Mandy Guo

Mandy Guo

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    LongT5: Efficient Text-To-Text Transformer for Long Sequences
    Joshua Ainslie
    David Uthus
    Jianmo Ni
    Yinfei Yang
    Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics
    Preview abstract Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks. View details
    Preview abstract Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN \cite{jia2021scaling}--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets; more importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL improves zero-shot mean recall by 14.4\% on average for eight under-resourced languages and by 6.6\% on average when fine-tuning. Interestingly, we also find that text representations learned from MURAL cluster based on areal linguistics as well, like the Balkan sprachbund, and not just language genealogy. View details
    Wiki-40B: Multilingual Language Model Dataset
    Zihang Dai
    Denny Vrandecic
    Rami Al-Rfou
    LREC 2020
    Preview abstract We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning. View details
    Character-Level Language Modeling with Deeper Self-Attention
    Rami Al-Rfou
    DK Choe
    Llion Jones
    Thirty-Third AAAI Conference on Artificial Intelligence (2019)
    Preview abstract LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving 1.13 bits per character on text8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions. View details
    Preview abstract Purely character-based language models have been lagging in quality on large scale datasets, and state-of-the-art language models currently rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the language model is essential to achieving competitive results. In this paper, we show that, contrary to this conventional wisdom, tokenizer-free language models with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve new state of the art results for tokenizer-free language models, pushing these models to be on par with their word-based counterparts. View details