Kishore Papineni
Kishore leads the Coauthor team whose objective is cross-lingual cross-modal access to dynamically organized information. His team hopes to make content consumption or creation a richer experience by surfacing relevant and diverse information from the web, possibly synthesized dynamically from across different sources or types of content such as text, images, charts, and videos. Coauthor team powers the web content suggestions in Google Docs when users are writing a document, and is working on additional content recommendation applications. His work at Google includes veracity of information on the web, depth of discourse on a topic in a document, drift of discourse on a topic on the web, identifying concepts peculiar to a collection of documents and relationships among the concepts, and identifying different perspectives in content. His past work was in the areas of automatic control theory, natural language understanding, dialog management, machine translation, and display advertisements. Prior to joining Google, he led machine learning at Yahoo! Research and machine translation at IBM Research. He is a coauthor of the BLEU metric for automatic evaluation of machine translation quality (awarded 2018 Test-of-Time Paper on Computational Linguistics). He was a founding Editor-in-Chief of ACM Transactions on Speech and Language Processing from 2003-2007.
Authored Publications
Sort By
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
Yuchen Li
Alexandre Kirchmeyer
Aashay Mehta
Yilong Qin
Andrej Risteski
International Conference on Machine Learning (2024) (to appear)
Preview abstract
Autoregressive language models are the currently dominant paradigm for text generation, however they have some fundamental limitations that cannot be remedied by scale ---for example inherently sequential and unidirectional generation. While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations. In this paper we focus on Generative Masked Language Models (GMLMs), a non-autoregressive paradigm in which we train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. These models empirically strike a promising speed-quality trade-off as each step can be typically parallelized by decoding the entire sequence in parallel. We develop a mathematical framework for analyzing and improving such models which sheds light on questions of sample complexity and inference speed and quality. Empirically, we adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality compared with autoregressive models. We run careful ablation experiments to give recommendations on key design choices, and make fine-grained observations on the common error modes in connection with our theory. Our mathematical analyses and empirical observations characterize both potentials and limitations of this approach, and can be applied to future works on improving understanding and performance of GMLMs.
View details
Preview abstract
It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between sensitivity and robustness to natural (non-adversarial) perturbations by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise on both vision and NLP datasets.
View details
Preview abstract
Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications.
View details