# Accelerating Attention through Gradient-Based Learned Runtime Pruning

ISCA (2022) (to appear)

## Abstract

Self-attention is a key enabler to achieve the state-of-art accuracy with various transformer-based Natural Language Processing (NLP) models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words correlate highly with the word under attention, which is only determined at runtime. As such, a significant amount of computation due to low attention score is inconsequential and can potentially be pruned at runtime. The challenge is finding the threshold for attention scores below which the following computation will be inconsequential. Although threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation enables piggy backing on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously. This analytical approach strikes a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed \leopard\footnote{\leopard: \textbf{L}earning thr\textbf{E}sholds for \textbf{O}n-the-fly \textbf{P}runing \textbf{A}cceleration of t\textbf{R}ansformer mo\textbf{D}els.}, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our proposed mathematics and hardware across 38 target back-end tasks defined for \bench{MemN2N}, \bench{BERT-Base}, and \bench{BERT-Large} state-of-the-art transformer models. Post-layout results show that, on average, \leopard yields \SpeedupOverBaseline and \EnergyOverBaseline speedup and energy reduction, respectively. These improvements are achieved while keeping the average accuracy virtually intact ($\leq 0.3\%$ loss).