ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You

Yipin Guo

Yichao Fu

Wei Zhou

Huihong Shi

Xiaofan Zhang

Souvikk Kundu

Amir Yazdanbakhsh

Yingyan Lin

38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024)

Download Google Scholar

Abstract

Large language models (LLMs) have shown impressive performance in language tasks but face challenges when deployed on devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and significant latency bottlenecks. Shift and add reparameterization offers a solution by replacing costly multiplications with efficient hardware primitives for both attention and multi-layer perceptrons (MLPs). However, current reparameterization techniques necessitate training from scratch or full parameter fine-tuning to restore accuracy which is often impractical for LLMs. To this end, we propose accelerating pretrained LLMs through a post-training shift and add reparameterization, towards efficient multiplication-less LLMs, dubbed ShiftAddLLM. Specifically, we quantize and reparameterize weight matrices in LLMs into binary matrices of identical shape, coupled with scaling factor matrices of reduced dimensions. Each scaling factor, corresponding to a group of weights, is quantized to powers of two. Such a reparameterization transforms the original multiplications between weights and activations into two steps: (1) bitwise shifts between activations and scaling factors, and (2) queries and additions of these results with the binary matrices. To mitigate accuracy drops, we adopt multiple optimization objectives for optimizing the reparameterization. To further reduce memory usage and latency, we develop a mixed and automatic bit allocation strategy that enables extreme quantization of LLMs. Moreover, we introduce ShiftAddLoRA to fine-tune the post-training ShiftAddLLM, achieving both fast and accurate inference and fine-tuning. Extensive experiments on various LLMs and downstream language tasks consistently validate the effectiveness of ShiftAddLLM.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Abstract

Learn more about how we conduct our research