Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

Sheng-Chun Kao
Shivani Agrawal
Suvinay Subramanian
Tushar Krishna
(2022) (to appear)
Google Scholar


Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attentions due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage few forms of N:M structured sparsity in the model to yield higher compute-efficiency. While there is a large body of work proposing various recipes for N:M structured sparsity training, compute-efficient training recipes for structured sparsity is rather a less explored territory. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute training cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely “pruning mask decay” and “sparse structure decay”. Our evaluations indicate that these proposed methods consistently deliver SOTA model accuracy, comparable to unstructured sparsity, on a transformer-based model for translate task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs).