Google Research

Faster Transformer Decoding: N-gram Masked Self-Attention

ArXiv, Google Research (2020)


Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work