Shankar Kumar

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications. View details
    Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical {RNN} Model
    You-Chi Cheng
    IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, {IEEE}, pp. 6097-6101
    Preview abstract Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate. View details
    Preview abstract Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora. However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains. We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data. Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain. We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection. When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus. View details
    Preview abstract Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature. View details
    Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 4950-4961
    Preview abstract The softmax layer in neural machine translation is designed to model the distribution over mutually exclusive tokens. Machine translation, however, is intrinsically uncertain: the same source sentence can have multiple semantically equivalent translations. Therefore, we propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively. We call our loss function Single-label Contrastive Objective for Non-Exclusive Sequences (SCONES). We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function. SCONES yields consistent BLEU score gains across six translation directions, particularly for medium-resource language pairs and small beam sizes. By using smaller beam sizes and avoiding the expensive softmax partition function we can speed up inference by a factor of X without any degradation in BLEU score. Furthermore, we demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations, thus mitigating the "beam search curse". Additional experiments on synthetic language pairs with varying levels of uncertainty suggest that the improvements from SCONES can be attributed to better handling of ambiguity. View details
    Preview abstract Text-editing models have recently become a prominent alternative to seq2seq models for monolingual natural language generation (NLG) tasks such as grammatical error correction, text simplification, and style transfer. These tasks exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this trait and learn to generate the output by predicting edit operations applied to the source sequence in contrast to seq2seq models that generate the output from scratch. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based approaches and current state-of-the-art models, analyzing the pros and cons of different methods. We discuss challenges related to productionization and how these models can to help mitigate hallucination and bias, both pressing challenges in the field of text generation. View details
    Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
    Ilia Kulikov
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2022), pp. 8634-8645
    Preview abstract A widely used approach for neural machine translation (NMT) is to train an autoregressive model by maximizing the probability of training sentence pairs in conjunction with a mode-seeking decoding strategy for inference. The ultimate goal is to reduce the system error, i.e. to achieve a high translation quality of unseen sentences. However, this high-level perspective is oblivious to potential pitfalls within the training and decoding pipeline. In this work we propose to measure mode and search errors in addition to the system error in order to better understand the connections amongst them. We study how these errors change when we vary both the decoding strategy and the degree of sparsity of the learned distribution. First, we empirically confirm the high prevalence of modeling errors in NMT, and that the relation between search error and system error is highly non-monotonic. Second, we show that adding sparsity to the model can effectively reduce both mode and search error. Analyzing the mode translations shows that the qualitative improvements are partially due to better length modeling. However, the overall system error slowly increases as we make the decoder sparse suggesting that the current choice of decoding strategy can be further improved in the context of sparse models. View details
    Conciseness: An Overlooked Language Task
    Aashish Kumar
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Abu Dhabi
    Preview abstract We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five raters, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with giant neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets. View details
    Preview abstract Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al (2019) where we replace the RNN with a Transformer in one or more stages. We evaluate the performance when initializing the encoder with a pre-trained BERT model. We compare these model variants with a vanilla Transformer that outputs string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT model yielding the best performance. View details
    Preview abstract In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences. View details