Accelerating Attention through Gradient-Based Learned Runtime Pruning

Amir Yazdanbakhsh

Hadi Esmaeilzadeh

Mingu Kang

Soroush Ghodrati

Zheng Li

ISCA (2022) (to appear)

Google Scholar

Abstract

Self-attention is a key enabler to achieve the state-of-art accuracy with various transformer-based Natural Language Processing (NLP) models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words correlate highly with the word under attention, which is only determined at runtime. As such, a significant amount of computation due to low attention score is inconsequential and can potentially be pruned at runtime. The challenge is finding the threshold for attention scores below which the following computation will be inconsequential. Although threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation enables piggy backing on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously. This analytical approach strikes a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed \leopard\footnote{\leopard: \textbf{L}earning thr\textbf{E}sholds for \textbf{O}n-the-fly \textbf{P}runing \textbf{A}cceleration of t\textbf{R}ansformer mo\textbf{D}els.}, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our proposed mathematics and hardware across 38 target back-end tasks defined for \bench{MemN2N}, \bench{BERT-Base}, and \bench{BERT-Large} state-of-the-art transformer models. Post-layout results show that, on average, \leopard yields \SpeedupOverBaseline and \EnergyOverBaseline speedup and energy reduction, respectively. These improvements are achieved while keeping the average accuracy virtually intact ($\leq 0.3\%$ loss).

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities