Shankar Kumar

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Measuring Re-identification Risk
    Travis Dick
    Adel Javanmard
    Josh Karlin
    Andres Munoz Medina
    Gabriel Henrique Nunes
    Peilin Zhong
    Preview abstract Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications. View details
    Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
    Felix Stahlberg
    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 4950-4961
    Preview abstract The softmax layer in neural machine translation is designed to model the distribution over mutually exclusive tokens. Machine translation, however, is intrinsically uncertain: the same source sentence can have multiple semantically equivalent translations. Therefore, we propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively. We call our loss function Single-label Contrastive Objective for Non-Exclusive Sequences (SCONES). We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function. SCONES yields consistent BLEU score gains across six translation directions, particularly for medium-resource language pairs and small beam sizes. By using smaller beam sizes and avoiding the expensive softmax partition function we can speed up inference by a factor of X without any degradation in BLEU score. Furthermore, we demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations, thus mitigating the "beam search curse". Additional experiments on synthetic language pairs with varying levels of uncertainty suggest that the improvements from SCONES can be attributed to better handling of ambiguity. View details
    Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical {RNN} Model
    Hao Zhang
    You-Chi Cheng
    IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, {IEEE}, pp. 6097-6101
    Preview abstract Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate. View details
    Preview abstract Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature. View details
    Preview abstract Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al (2019) where we replace the RNN with a Transformer in one or more stages. We evaluate the performance when initializing the encoder with a pre-trained BERT model. We compare these model variants with a vanilla Transformer that outputs string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT model yielding the best performance. View details
    Preview abstract Text-editing models have recently become a prominent alternative to seq2seq models for monolingual natural language generation (NLG) tasks such as grammatical error correction, text simplification, and style transfer. These tasks exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this trait and learn to generate the output by predicting edit operations applied to the source sequence in contrast to seq2seq models that generate the output from scratch. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based approaches and current state-of-the-art models, analyzing the pros and cons of different methods. We discuss challenges related to productionization and how these models can to help mitigate hallucination and bias, both pressing challenges in the field of text generation. View details
    Conciseness: An Overlooked Language Task
    Aashish Kumar
    Felix Stahlberg
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Abu Dhabi
    Preview abstract We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five raters, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with giant neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets. View details
    Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
    Felix Stahlberg
    Ilia Kulikov
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2022), pp. 8634-8645
    Preview abstract A widely used approach for neural machine translation (NMT) is to train an autoregressive model by maximizing the probability of training sentence pairs in conjunction with a mode-seeking decoding strategy for inference. The ultimate goal is to reduce the system error, i.e. to achieve a high translation quality of unseen sentences. However, this high-level perspective is oblivious to potential pitfalls within the training and decoding pipeline. In this work we propose to measure mode and search errors in addition to the system error in order to better understand the connections amongst them. We study how these errors change when we vary both the decoding strategy and the degree of sparsity of the learned distribution. First, we empirically confirm the high prevalence of modeling errors in NMT, and that the relation between search error and system error is highly non-monotonic. Second, we show that adding sparsity to the model can effectively reduce both mode and search error. Analyzing the mode translations shows that the qualitative improvements are partially due to better length modeling. However, the overall system error slowly increases as we make the decoder sparse suggesting that the current choice of decoding strategy can be further improved in the context of sparse models. View details
    Preview abstract Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora. However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains. We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data. Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain. We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection. When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus. View details
    Preview abstract In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences. View details
    Preview abstract Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text generated either by an automatic system for speech recognition or machine translation or by humans. It improves the performance of downstream NLP tasks such as named entity recognition and language modeling. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model, the first of its kind for this problem. Using sequence distillation, we also address the problem of truecasing while ignoring token positions in the sentence, i.e. in a position-invariant manner. View details
    Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models
    Felix Stahlberg
    Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications(2021)
    Preview abstract Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to realistically generate the broad range of grammatical errors made by human writers in practice. In this work, we use explicit error-type tags from automatic annotation tools like ERRANT to guide synthetic data generation. We compare several models that can produce ungrammatical sentences given a clean sentence and an error type tag, and use these models to build a new large synthetic pre-training set that matches the tag frequency distributions in a development set. Our synthetic data set yields large and consistent gains, leading to state-of-the-art performance on the BEA-test and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system that has been trained on mixed native and non-native English to a native English test set, even surpassing real training data consisting of high-quality sentence pairs. View details
    Data Strategies for Low-Resource Grammatical Error Correction
    Simon Flachs
    Felix Stahlberg
    Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, ACL,
    Preview abstract Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However for other low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. In particular, we compare methods for generating artificial error data to train GEC systems, and show that these methods can benefit from including morphological errors. We then look into the usefulness of noisy error correction data gathered from Wikipedia and the language learning website Lang8, and demonstrate that despite their inherent noise, these are valuable data sources. Finally, we show that GEC systems pre-trained on the noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data. View details
    Preview abstract We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily -- with a commensurate increase in performance -- without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations. View details
    Sequence Transduction Using Span-level Edit Operations
    Felix Stahlberg
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 5147-5159
    Preview abstract We propose an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. We represent sequence-to-sequence transduction as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We test our method on five NLP tasks (text normalization, sentence fusion, sentence splitting and rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. We show that our method has clear speed advantages over full sequence models for grammatical error correction because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, we associate each edit operation with a task-specific tag to improve explainability. View details
    Corpora Generation for Grammatical Error Correction
    Jared Lichtarge
    Noam Shazeer
    Niki J. Parmar
    Simon Tong
    (2019) (to appear)
    Preview abstract Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling. View details
    Preview abstract For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phoneme-based models. We also compare grapheme and phoneme-based end-to-end approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects. View details
    Preview abstract Recurrent neural network language models (RNNLM) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks. However, these models are computationally more expensive than N-gram LMs for decoding, and thus, challenging to integrate into speech recognizers. Recent research has proposed the use of lattice-rescoring algorithms using RNNLMs and LSTMLMs as an efficient strategy to integrate these models into a speech recognition system. In this paper, we evaluate existing lattice rescoring algorithms along with a few of our own novel variants on a Youtube speech recognition task. Lattice rescoring using LSTMLMs reduces the word error rate (WER) for this task by about 6\% relative to the WER obtained using an N-gram LM. View details
    Approaches for Neural-Network Language Model Adaptation
    Michael Alexander Nirschl
    Min Ma
    Interspeech 2017, Stockholm, Sweden(2017)
    Preview abstract Language Models (LMs) for Automatic Speech Recognition (ASR) are typically trained on large text corpora from news articles, books and web documents. These types of corpora, however, are unlikely to match the test distribution of ASR systems, which expect spoken utterances. Therefore, the LM is typically adapted to a smaller held-out in-domain dataset that is drawn from the test distribution. We present three LM adaptation approaches for Deep NN and Long Short-Term Memory (LSTM): (1) Adapting the softmax layer in the NN; (2) Adding a non-linear adaptation layer before the softmax layer that is trained only in the adaptation phase; (3) Training the extra non-linear adaptation layer in pre-training and adaptation phases. Aiming to improve upon a hierarchical Maximum Entropy (MaxEnt) second-pass LM baseline, which factors the model into word-cluster and word models, we build an NN LM that predicts only word clusters. Adapting the LSTM LM by training the adaptation layer in both training and adaptation phases (Approach 3), we reduce the cluster perplexity by 30% compared to an unadapted LSTM model. Initial experiments using a state-of-the-art ASR system show a 2.3% relative reduction in WER on top of an adapted MaxEnt LM. View details
    Preview abstract Open domain relation extraction systems identify relation and argument phrases in a sentence without relying on any underlying schema. However, current state-of-the-art relation extraction systems are available only for English because of their heavy reliance on linguistic tools such as part-of-speech taggers and dependency parsers. We present a cross-lingual annotation projection method for language independent relation extraction. We evaluate our method on a manually annotated test set and present results on three typologically different languages. We release these manual annotations and extracted relations in ten languages from Wikipedia. View details
    Preview abstract Large language models have been proven quite beneficial for a variety of automatic speech recognition tasks in Google. We summarize results on Voice Search and a few YouTube speech transcription tasks to highlight the impact that one can expect from increasing both the amount of training data, and the size of the language model estimated from such data. Depending on the task, availability and amount of training data used, language model size and amount of work and care put into integrating them in the lattice rescoring step we observe reductions in word error rate between 6% and 10% relative, for systems on a wide range of operating points between 17% and 52% word error rate. View details
    Model Combination for Machine Translation
    John DeNero
    Franz Och
    Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)(2010), pp. 975-983
    Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
    Chris Dyer
    Franz Och
    Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ACL and AFNLP(2009), pp. 163-171
    Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation
    Franz Och
    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 620-629
    Preview abstract We present Minimum Bayes-Risk (MBR) decoding over translation lattices that compactly encode a huge number of translation hypotheses. We describe conditions on the loss function that will enable efficient implementation of MBR decoders on lattices. We introduce an approximation to the BLEU score~\cite{papineni01} that satisfies these conditions. The MBR decoding under this approximate BLEU is realized using Weighted Finite State Automata. Our experiments show that the Lattice MBR decoder yields moderate, consistent gains in translation performance over N-best MBR decoding on Arabic-to-English, Chinese-to-English and English-to-Chinese translation tasks. We conduct a range of experiments to understand why Lattice MBR improves upon N-best MBR and also study the impact of various parameters on MBR performance. View details
    Improving Word Alignment with Bridge Languages
    Franz Och
    Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 209 N. Eighth Street, East Stroudsburg, PA, USA(2007)
    Preview abstract We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. View details
    No Results Found