Character-Level Language Modeling with Deeper Self-Attention

Rami Al-Rfou

DK Choe

Noah Constant

Mandy Guo

Llion Jones

Thirty-Third AAAI Conference on Artificial Intelligence (2019)

Download Google Scholar

Abstract

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving 1.13 bits per character on text8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

Research Areas

Natural Language Processing
Machine Intelligence

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Character-Level Language Modeling with Deeper Self-Attention

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Character-Level Language Modeling with Deeper Self-Attention

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities