data:image/s3,"s3://crabby-images/b7310/b7310f3d3b083f0b1bfa5ff0e6fa3262014f9f71" alt="Xiaofan Zhang"
Xiaofan Zhang
Authored Publications
Sort By
SSDTrain: Faster Large Language Model Training Using SSD-Based Activation Offloading
Kun Wu
Jeongmin Brian Park
Mert Hidayetoğlu
Vikram Sharma Mailthody
Sitao Huang
Steven Lumetta
Wen-mei Hwu
Design Automation Conference (DAC) (2025)
Preview abstract
The scaling up of Large Language Models (LLMs) demands more memory than current GPUs can provide, hindering the training process. To address this challenge, we propose SSDTrain to efficiently offload activations, the intermediate tensors produced during LLM training, to SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on Llama, BERT, and T5. Results demonstrate that SSDTrain effectively reduces 45% of the activation peak memory usage. It can perfectly overlap the IO with the computation without introducing performance penalty. SSDTrain can achieve a performance boost of up to 31% compared to the conventional training strategy using the same GPU systems.
View details
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Haoran You
Yipin Guo
Yichao Fu
Wei Zhou
Huihong Shi
Souvikk Kundu
Yingyan Lin
38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Large language models (LLMs) have shown impressive performance in language tasks but face challenges when deployed on devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and significant latency bottlenecks. Shift and add reparameterization offers a solution by replacing costly multiplications with efficient hardware primitives for both attention and multi-layer perceptrons (MLPs). However, current reparameterization techniques necessitate training from scratch or full parameter fine-tuning to restore accuracy which is often impractical for LLMs. To this end, we propose accelerating pretrained LLMs through a post-training shift and add reparameterization, towards efficient multiplication-less LLMs, dubbed ShiftAddLLM. Specifically, we quantize and reparameterize weight matrices in LLMs into binary matrices of identical shape, coupled with scaling factor matrices of reduced dimensions. Each scaling factor, corresponding to a group of weights, is quantized to powers of two. Such a reparameterization transforms the original multiplications between weights and activations into two steps: (1) bitwise shifts between activations and scaling factors, and (2) queries and additions of these results with the binary matrices. To mitigate accuracy drops, we adopt multiple optimization objectives for optimizing the reparameterization. To further reduce memory usage and latency, we develop a mixed and automatic bit allocation strategy that enables extreme quantization of LLMs. Moreover, we introduce ShiftAddLoRA to fine-tune the post-training ShiftAddLLM, achieving both fast and accurate inference and fine-tuning. Extensive experiments on various LLMs and downstream language tasks consistently validate the effectiveness of ShiftAddLLM.
View details
AutoAI2C: An Automated Hardware Generator for DNN Acceleration on both FPGA and ASIC
Yongan Zhang
Pengfei Xu
Yang Zhao
Cong Hao
Deming Chen
Yingyan Lin
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024)
Preview abstract
Recent advancements in Deep Neural Networks (DNNs) and the slowing of Moore’s law have made domainspecific hardware accelerators for DNNs (i.e., DNN chips) a promising means for enabling more extensive DNN applications. However, designing DNN chips is challenging due to (1) the vast and non-standardized design space and (2) different DNN models’ varying performance preferences regarding hardware micro-architecture and dataflows. Therefore, designing a DNN chip often takes a large team of inter-disciplinary experts months to years. To enable flexible and efficient DNN chip design, we propose AutoAI2C: a DNN chip generator that can automatically generate both FPGA- and ASIC-based DNN accelerator implementation (i.e., synthesizable hardware and deployment code) with optimized algorithm-to-hardware mapping, given the DNN model specification from mainstream machine learning frameworks (e.g., PyTorch). Specifically, AutoAI2C consists of two major components: (1) a Chip Predictor, which can efficiently and reliably predict a DNN accelerator’s energy, latency, and resource
consumption using a customized graph-based intermediate accelerator representation and (2) a Chip Builder, which can generate and optimize DNN accelerator designs by automatically exploring the design space based on targeting metrics and the Chip Predictor’s performance feedback. Extensive experiments show that our Chip Predictor’s predictions differ by <10% from realmeasured ones. Furthermore, AutoAI2C generated accelerators can achieve performance comparable to or better than (1.10× to 2.12× speedup) state-of-the-art accelerators, validating the effectiveness and advantages of AutoAI2C
View details
EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture Search
Qian Jiang
Deming Chen
Minh N. Do
Raymond Yeh
ICML 2023 Workshop on Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators
Preview abstract
In hardware-aware Differentiable Neural Architecture Search (DNAS), it is challenging to integrate hardware metrics into network architecture search. To handle hardware metrics, such as inference latency, existing works mainly rely on linear approximations and lack of support for various customized hardware. In this work, we propose End-to-end Hardware-aware DNAS (EH-DNAS), a seamless integration of an end-to-end hardware performance differentiable approximation, and a fully automated DNAS to deliver hardware-efficient deep neural networks on various hardware, including Edge GPUs, Edge TPUs, Mobile CPUs, and customized accelerators. Given a targeted hardware platform, we propose to learn a differentiable model predicting the end-to-end hardware performance of the neural network architectures during DNAS. We also propose E2E-Perf, a benchmarking tool to expand our design to support customized accelerators. Experiments on CIFAR10 and ImageNet show that EH-DNAS improves the hardware performance by an average of 1.5 times on customized accelerators and existing hardware processors than the state-of-the-art efficient networks while maintaining highly competitive model inference accuracy.
View details
Preview abstract
Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
View details