Hao Zhang
My research interests are in natural language processing, including machine translation and parsing. This page lists my publications at Google. For the complete publication list, please check out my Google Scholar page .
Research Areas
Authored Publications
Sort By
Text Injection for Capitalization and Turn-taking Prediction In ASR Models
Weiran Wang
Zhong Meng
Interspeech 2023 (2023)
Preview abstract
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.
View details
Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical {RNN} Model
You-Chi Cheng
IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, {IEEE}, pp. 6097-6101
Preview abstract
Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.
View details
Preview abstract
Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text generated either by an automatic system for speech recognition or machine translation or by humans. It improves the performance of downstream NLP tasks such as named entity recognition and language modeling. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model, the first of its kind for this problem. Using sequence distillation, we also address the problem of truecasing while ignoring token positions in the sentence, i.e. in a position-invariant manner.
View details
Preview abstract
Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%.
View details
Neural Models of Text Normalization for Speech Applications
Felix Stahlberg
Ke Wu
Richard Sproat
Xiaochang Peng
Computational Linguistics, 45(2) (2019) (to appear)
Preview abstract
Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that "123" is verbalized as "one hundred twenty three" in "123 pages" but "one twenty three" in "123 King Ave". For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars.
In this paper we present neural network models which treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model (in terms of efficiency and accuracy) is a model where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process.
The neural models perform very well overall, but there is one problem, namely that occasionally they will predict inappropriate verbalizations, such as reading "3cm" as "three kilometers". While rare, such verbalizations are a major issue for TTS applications. To deal with such cases, we develop an approach based on finite-state "covering grammars", which can be used to guide the neural models (either during training and decoding, or just during decoding) away from such "silly" verbalizations. These covering grammars can also largely be learned from data.
View details
Preview abstract
Neural text normalization systems achieve
high accuracy, but the errors they do make can
include not only “acceptable” errors (such as
reading $3 as three dollar) but also unacceptable
errors (reading $3 as three euros). We explore
ways of training dual encoder classifiers
with both positive and negative data to then
use as soft constraints in neural text normalization
in order to decrease the number of unacceptable
errors. Already-low error rates and
high variability in performance on the evaluation
set make it difficult to determine when improvement
is significant, but qualitative analysis
suggests that certain types of dual encoder
constraints yield systems that make fewer unacceptable
errors.
View details
Preview abstract
Recognizing written domain numeric utterances (e.g., I need
$1.25.) can be challenging for ASR systems, particularly when
numeric sequences are not seen during training. This out-ofvocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances
(e.g., I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and
then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low
memory setting of on-device speech recognition. E2E models
such as RNN-T, are attractive for on-device ASR, as they fold
the AM, PM and LM of a conventional model into one neural
network. However, in the on-device setting the large memory
footprint of an FST denormer makes spoken domain training
more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that
using a text-to-speech system to generate additional numeric
training data, as well as using a small-footprint neural network
to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest
numeric sequences, we see reduction of WER by up to a factor
of 7.in this setting forces training back into the written domain, resulting in poor model performance on numeric sequences. In
this paper, we investigate different techniques to improve E2E
model performance on numeric data. We find that by using a
text-to-speech system to generate additional training data that
emphasizes difficult numeric utterances, as well as by using
an independently-trained small-footprint neural network to perform spoken-to-written domain denorming, we achieve strong
results in several numeric classes. In the case of the longest numeric sequences, for which the OOV issue is most prevalent,
we see reduction of WER by up to a factor of 7.
View details
Preview abstract
Attention-based sequence-to-sequence neural network models learn to jointly align and translate. The quadratic-time attention mechanism is powerful as it is capable of handling arbitrary long- distance reordering, but computationally expensive. In this paper, towards making neural translation both accurate and efficient, we follow the traditional pre-reordering approach to decouple reordering from translation. We add a reordering RNN that shares the input encoder with the decoder. The RNNs are trained jointly with a multi-task loss function and applied sequentially at inference time. The task of the reordering model is to predict the permutation of the input words following the target language word order. After reordering, the attention in the decoder becomes more peaked and monotonic. For reordering, we adopt the Inversion Transduction Grammars (ITG) and propose a transition system to parse input to trees for reordering. We harness the ITG transition system with RNN. With the modeling power of RNN, we achieve superior reordering accuracy without any feature engineering. In experiments, we apply the model to the task of text normalization. Compared to a strong baseline of attention-based RNN, our ITG RNN reordering model can reach the same reordering accuracy with only 1/10 of the training data and is 2.5x faster in decoding.
View details
Knowledge Exploration using Tables on the Web
Fernando Chirigati
Cong Yu
Proceedings of the VLDB Endowment, 10 (2017), pp. 193-204
Preview abstract
The increasing popularity of mobile device usage has ushered in many features in modern search engines that help users with various information needs. One of those needs is Knowledge Exploration, where related documents are returned in response to a user query, either directly through right-hand side knowledge panels or indirectly through navigable sections underneath individual search results. Existing knowledge exploration features have relied on a combination of Knowledge Bases and query logs. In this paper, we propose Knowledge Carousels of two modalities, namely sideways and downwards, that facilitate exploration of IS-A and HAS-A relationships, respectively, with regard to an entity-seeking query, based on leveraging the large corpus of tables on the Web. This brings many
technical challenges, including associating correct carousels with the search entity, selecting the best carousel from the candidates, and finding titles that best describe the carousel. We describe how we address these challenges and also experimentally demonstrate through user studies that our approach produces better result sets than baseline approaches.
View details