Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 235 publications
    Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
    Fei He
    Shan Hui Cathy Chu
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Alena Butryna
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6504-6513
    Preview abstract In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data. View details
    Preview abstract This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. This approach is proved to be effective in a public MS MARCO benchmark [3]. Our submissions achieve the best performance for the passage re-ranking task as of March 30, 2020 [4], and the second best performance for the passage full-ranking task as of April 10, 2020 [5], demonstrating the effectiveness of combining ranking losses with BERT representations for document ranking. View details
    Tapas: Weakly Supervised Table Parsing via Pre-training
    Thomas Müller
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Seattle, Washington, United States (2020)
    Preview abstract Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art. View details
    Preview abstract We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them. View details
    Preview abstract We present neural machine translation (NMT) models inspired by echo state network (ESN), named Echo State NMT (ESNMT), in which the encoder and decoder layer weights are randomly generated then fixed throughout training. We show that even with this extremely simple model construction and training procedure, ESNMT can already reach 70-80\% performance of fully trainable baselines. We examine how spectral radius of the reservoir, a key quantity that characterizes the model, determines the model behavior. Our findings indicate that randomized networks can work well even for complicated sequence-to-sequence prediction NLP tasks. View details
    Preview abstract Building equitable and inclusive technologies demands paying attention to how social attitudes towards persons with disabilities are represented within technology. Representations perpetuated by NLP models often inadvertently encode undesirable social biases from the data on which they are trained. In this paper, first we present evidence of such undesirable biases towards mentions of disability in two different NLP models: toxicity prediction and sentiment analysis. Next, we demonstrate that neural embeddings that are critical first steps in most NLP pipelines also contain undesirable biases towards mentions of disabilities. We then expose the topical biases in the social discourse about some disabilities which may explain such biases in the models; for instance, terms related to gun violence, homelessness, and drug addiction are over-represented in discussions about mental illness. View details
    Preview abstract Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to achieve improved captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, using a technique that makes use of a sampling distribution that focuses on the captions that are present in a caption-ratings dataset. We present empirical evidence that indicates that our models learn to generalize the human raters’judgments in the caption-ratings training data to a previously unseen set of images, as judged by a different set of human judges and additionally on a different, multi-dimensional side-by-side human evaluation procedure. View details
    Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles
    Shan Jiang
    Abe Ittycheriah
    Cong Yu
    Proceedings of the 2020 Web Conference (WWW 2020)
    Preview abstract Fact-checking, which investigates claims made in public to arrive at a verdict supported by evidence and logical reasoning, has long been a significant form of journalism to combat misinformation in the news ecosystem. Most of the fact-checks share common structured information (called factors) such as claim, claimant, and verdict. In recent years, the emergence of ClaimReview as the standard schema for annotating those factors within fact-checking articles has led to wide adoption of fact-checking features by online platforms (e.g., Google, Bing). However, annotating fact-checks is a tedious process for fact-checkers and distracts them from their core job of investigating claims. As a result, less than half of the fact-checkers worldwide have adopted ClaimReview as of mid-2019. In this paper, we propose the task of factoring fact-checks for automatically extracting structured information from fact-checking articles. Exploring a public dataset of fact-checks, we empirically show that factoring fact-checks is a challenging task, especially for fact-checkers that are under-represented in the existing dataset. We then formulate the task as a sequence tagging problem and fine-tune the pre-trained BERT models with a modification made from our observations to approach the problem. Through extensive experiments, we demonstrate the performance of our models for well-known fact-checkers and promising initial results for under-represented fact-checkers. View details
    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
    Nathanael Schärli
    Nathan Scales
    Hylke Buisman
    Daniel Furrer
    Nikola Momchev
    Danila Sinopalnikov
    Lukasz Stafiniak
    Tibor Tihon
    Dmitry Tsarkov
    ICLR (2020)
    Preview abstract State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings. View details
    Preview abstract We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting (MQR) dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources. View details
    Preview abstract We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our Sinkhorn Attention remains competitive to the vanilla attention, consistently outperforming recently proposed efficient Transformer models such as Sparse Transformers, while retaining memory efficiency. View details
    Preview abstract Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation. We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions. Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training. We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach. View details
    Preview abstract This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models. Concretely, the study of artifacts that emerge in machine generated text as a result of modeling choices is a nascent research area. To this end, the extent and degree to which these artifacts surface in generated text is still unclear. In the spirit of better understanding generative text models and their artifacts, we propose the new task of distinguishing which of several variants of a given model generated some piece of text. Specifically, we conduct an extensive suite of diagnostic tests to observe whether modeling choices (e.g., sampling methods, top-$k$ probabilities, model architectures, etc.) leave detectable artifacts in the text they generate. Our key finding, which is backed by a rigorous set of experiments, is that such artifacts are present and that different modeling choices can be inferred by looking at generated text alone. This suggests that neural text generators may actually be more sensitive to various modeling choices than previously thought. View details
    Processing South Asian languages written in the Latin script: the Dakshina dataset
    Lawrence Wolf-Sonkin
    Christo Kirov
    Sabrina J. Mielke
    Keith Hall
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC) (2020), 2413–2423
    Preview abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. View details
    Preview abstract Named Entity Recognition (NER) is a fundamental task in Natural Language Processing, concerned with identifying spans of text expressing references to entities. NER research is often focused on flat entities only (flat NER), ignoring the fact that entity references can be nested, as in [Bank of [China]] (Finkel and Manning, 2009). In this paper, we use ideas from graph-based dependency parsing to provide our model a global view on the input via a biaffine model (Dozat and Manning, 2017). The biaffine model scores pairs of start and end tokens in a sentence which we use to explore all spans, so that the model is able to predict named entities accurately. We show that the model works well for both nested and flat NER through evaluation on 8 corpora and achieving SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points. View details