ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You
Yipin Guo
Yichao Fu
Wei Zhou
Huihong Shi
Souvikk Kundu
Yingyan Lin
38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024)

Abstract

Large language models (LLMs) have shown impressive performance in language tasks but face challenges when deployed on devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and significant latency bottlenecks. Shift and add reparameterization offers a solution by replacing costly multiplications with efficient hardware primitives for both attention and multi-layer perceptrons (MLPs). However, current reparameterization techniques necessitate training from scratch or full parameter fine-tuning to restore accuracy which is often impractical for LLMs. To this end, we propose accelerating pretrained LLMs through a post-training shift and add reparameterization, towards efficient multiplication-less LLMs, dubbed ShiftAddLLM. Specifically, we quantize and reparameterize weight matrices in LLMs into binary matrices of identical shape, coupled with scaling factor matrices of reduced dimensions. Each scaling factor, corresponding to a group of weights, is quantized to powers of two. Such a reparameterization transforms the original multiplications between weights and activations into two steps: (1) bitwise shifts between activations and scaling factors, and (2) queries and additions of these results with the binary matrices. To mitigate accuracy drops, we adopt multiple optimization objectives for optimizing the reparameterization. To further reduce memory usage and latency, we develop a mixed and automatic bit allocation strategy that enables extreme quantization of LLMs. Moreover, we introduce ShiftAddLoRA to fine-tune the post-training ShiftAddLLM, achieving both fast and accurate inference and fine-tuning. Extensive experiments on various LLMs and downstream language tasks consistently validate the effectiveness of ShiftAddLLM.