Accelerating Attention through Gradient-Based Learned Runtime Pruning

Amir Yazdanbakhsh; Hadi Esmaeilzadeh; Mingu Kang; Soroush Ghodrati; Zheng Li

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Amir Yazdanbakhsh

Hadi Esmaeilzadeh

Mingu Kang

Soroush Ghodrati

Zheng Li

ISCA (2022) (to appear)

Google Scholar

Abstract

Self-attention is a key enabler to achieve the state-of-art accuracy with various transformer-based Natural Language Processing (NLP) models.
This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence.
Commonly, only a small subset of words correlate highly with the word under attention, which is only determined at runtime.
As such, a significant amount of computation due to low attention score is inconsequential and can potentially be pruned at runtime.
The challenge is finding the threshold for attention scores below which the following computation will be inconsequential.
Although threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
This formulation enables piggy backing on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously.
This analytical approach strikes a formally optimal balance between accuracy and computation pruning.
To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed \leopard\footnote{\leopard: \textbf{L}earning thr\textbf{E}sholds for \textbf{O}n-the-fly \textbf{P}runing \textbf{A}cceleration of t\textbf{R}ansformer mo\textbf{D}els.}, for transformer language models with bit-level early termination microarchitectural mechanism.
We evaluate our proposed mathematics and hardware across 38 target back-end tasks defined for \bench{MemN2N}, \bench{BERT-Base}, and \bench{BERT-Large} state-of-the-art transformer models.
Post-layout results show that, on average, \leopard yields \SpeedupOverBaseline and \EnergyOverBaseline speedup and energy reduction, respectively. These improvements are achieved while keeping the average accuracy virtually intact ($\leq 0.3\%$ loss).

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs