Block level verification accelerates speculative decoding

Ziteng Sun
Uri Mendlovic
Yaniv Leviathan
Asaf Aharoni
Ahmad Beirami
2025

Abstract

Speculative decoding has shown to be an effective method for lossless acceleration of large language models during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the distribution of the final output is identical to that of the distribution of the large model. In prior speculative decoding works, the draft verification is performed token-by-token independently. Somewhat surprisingly, we show that this approach is sub-optimal. We propose a simple, easy-to-implement improved draft verification algorithm that provides additional wall-clock speedup by verifying the entire blocks jointly without incurring additional computation cost and draft tokens. We show that the proposed mechanism is never worse than the standard token-level verification and optimal in the expected number of accepted tokens. We also provide another variant verification algorithm which might be of independent interest. We empirically evaluate our proposed block-level verification algorithm in a wide range of tasks and datasets, and observe consistent improvements in wall-clock speedup when compared to the standard token-level verification algorithm. While the improvements are not huge, the change is minimal and adds no code complexity or other overhead. We recommend our block verification algorithm be used by default by all speculative decoding implementations.
×