RealFormer: Transformer Likes Residual Attention

Anirudh Ravula
Joshua Ainslie
Ruining He
(2021)
Google Scholar

Abstract

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform canonical Transformer and its variations of different sizes on a wide spectrum of tasks/benchmarks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. Qualitatively, RealFormer stabilizes training and leads to models with sparser attentions. Code and pre-trained checkpoints will be open-sourced.

Research Areas