Faster Transformer Decoding: N-gram Masked Self-Attention

Ciprian Chelba; Mia Chen; Ankur Bapna; Noam Shazeer

Faster Transformer Decoding: N-gram Masked Self-Attention

Ciprian Chelba

Mia Chen

Ankur Bapna

Noam Shazeer

ArXiv, Google Research (2020)

Download Google Scholar

Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Faster Transformer Decoding: N-gram Masked Self-Attention

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs