Mandy Guo
Research Areas
Authored Publications
Sort By
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Joshua Ainslie
David Uthus
Jianmo Ni
Yinfei Yang
Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics
Preview abstract
Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
View details
MURAL: Multimodal, Multitask Retrieval Across Languages
Aashi Jain
Krishna Srinivasan
Ting Chen
Chao Jia
Yinfei Yang
EMNLP (2021)
Preview abstract
We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning.
View details
Character-Level Language Modeling with Deeper Self-Attention
Rami Al-Rfou
DK Choe
Llion Jones
Thirty-Third AAAI Conference on Artificial Intelligence (2019)
Preview abstract
LSTMs and other RNN variants have shown strong performance on character-level language modeling.
These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts.
In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving 1.13 bits per character on text8.
To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.
View details
Preview abstract
Purely character-based language models have been lagging in quality on large scale datasets, and state-of-the-art language models currently rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the language model is essential to achieving competitive results.
In this paper, we show that, contrary to this conventional wisdom, tokenizer-free language models with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve new state of the art results for tokenizer-free language models, pushing these models to be on par with their word-based counterparts.
View details