Speculative cascades — A hybrid approach for smarter, faster LLM inference

LLMs have transformed how we interact with technology, powering everything from advanced search capabilities to creative coding assistants. But this power comes at a cost: inference (the process of generating a response) can be slow and computationally expensive. As we deploy these models to more users, making them faster and less expensive without sacrificing quality is a critical challenge.

One way to accomplish this would be to use cascades, which aim to optimize LLM efficiency by strategically using smaller, faster models before engaging a larger, more expensive LLM. This approach involves a deferral rule where the smaller model decides if it can handle a query or if it needs to pass the task to a more capable, but costlier, large model. The goal is to process as much as possible cheaply and quickly, only incurring the high cost of the large LLM for complex tasks that truly require its advanced capabilities, potentially yielding favorable cost-quality trade-offs. Cascades prioritize computational cost reduction and efficient resource allocation, while allowing for some variability in quality.

Another approach, speculative decoding, optimizes an LLM’s latency and throughput without altering the final result. It achieves this by employing a smaller, faster "drafter" model to predict a sequence of future tokens. These speculated tokens are then quickly verified in parallel by the larger “target” model. If the draft is accepted, the large model effectively generates multiple tokens in a single step, greatly accelerating the process while guaranteeing that the final output is identical to what the large model would have produced on its own. This approach prioritizes speed and latency reduction, potentially at the cost of increased memory usage and less computational savings, since the larger model still performs substantial work.

In “Faster Cascades via Speculative Decoding”, we introduce “speculative cascades”, a new approach that combines the best of both cascades and speculative decoding. It delivers better LLM output quality at a lower computational cost than either technique alone by sometimes deferring to the smaller LLM for the sake of efficiency. We tested new speculative cascading techniques against standard cascading and speculative decoding baselines using Gemma and T5 models on various language tasks, including summarization, translation, reasoning, coding, and question answering. The results show that the proposed speculative cascades achieve better cost-quality trade-offs, often yielding higher speed-ups and better quality metrics compared to the baselines.

A deeper look

To fully understand and appreciate the speculative cascades approach, we first compare cascades and speculative decoding with a simple example. Imagine you ask an LLM a straightforward question:

Prompt: "Who is Buzz Aldrin?"

Let's say we have two models available to answer this: a small, fast "drafter" model and a large, powerful "expert" model.

Here's how they might respond:

Small Model: Buzz Aldrin is an American former astronaut, engineer, and fighter pilot, best known as the second person to walk on the Moon.
Large Model: Edwin "Buzz" Aldrin, a pivotal figure in the history of space exploration, is an American former astronaut, engineer, and fighter pilot who is best known for being the second human to walk on the Moon.

Both models provide excellent, factually correct answers, but they interpret the user's intent slightly differently. The small model delivers a quick, factual summary, while the large model provides a more formal, encyclopedic-style entry. Depending on the user's need — be it a fast fact or a detailed overview — either response could be considered ideal. The key is that they represent two distinct, equally valid styles.

Now, let's see how the two main speed-up techniques handle this scenario.

With cascades, the small "drafter" model gets the prompt first. If it's confident in its answer, it replies. If not, it defers the entire task to the large "expert" model.

In our example:

The small model generates its concise and correct answer.
It checks its confidence and, finding it high, sends the response to the user.

This works! We get a great answer quickly. But the process is sequential. If the small model hadn't been confident, we would have wasted time waiting for it to finish, only to then start the large model from scratch. This sequential "wait-and-see" approach is a fundamental bottleneck.

With speculative decoding, the small model quickly drafts the first few tokens of the answer, and the large model verifies it in parallel, correcting the first mistake it finds.

In our example:

The small model drafts the beginning of its answer: [Buzz, Aldrin, is, an, ...]
The large model verifies this draft. Its own preferred first token is Edwin.
Since Buzz ≠ Edwin, the very first token is a mismatch.
The entire draft is rejected and the first token is replaced with Edwin. The process then repeats from this corrected point to generate the rest of the answer, but the initial speed advantage has been lost.

Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior. While the above example uses a simple token matching rejection rule, in the full paper, we also include the potential for a "probabilistic match" that provides greater flexibility in the token-by-token comparison.

Different goals, different trade-offs

The "Buzz Aldrin" example reveals a fundamental difference between these two techniques, as summarized below:

A visual representation of the trade-offs offered by standard cascades (left) and speculative decoding (right). In both graphs, the green star is the small, fast model (low cost, lower quality) and the red star is the large, slow model (high cost, higher quality). The dots in the left graph represent different trade-offs offered by cascades by varying its confidence threshold; the blue star in the right graph represents the trade-off offered by speculative decoding.

Speculative cascades: Best of both worlds

Speculative cascades combine the idea of tiered processing from standard cascades with the speedup mechanism of speculative decoding. It involves a smaller model generating a "draft" output that a larger model then quickly verifies in parallel. The key innovation is replacing the strict verification of speculative decoding with a flexible “deferral rule”. This rule dynamically decides, on a token-by-token basis, whether to accept the small model's draft or defer to the large model. This avoids the sequential bottleneck of standard cascades while allowing the system to accept a good answer from the small model even if it doesn't exactly match the large model's preferred output.

In our example:

The small model drafts the beginning of its answer: [Buzz, Aldrin, is, an, ...]
Simultaneously, the large model evaluates the draft, providing its own scores.
The crucial step: A flexible deferral rule looks at both outputs and decides whether a deferral is warranted.
If the system decides not to defer, it accepts the small model's draft tokens. The process then efficiently repeats from this new point, drafting and verifying the next chunk of text until the answer is complete.

The power of this method lies in its flexibility, as the deferral rule can be tailored to different needs.

For example, we could tell the system to defer based on:

A simple confidence check: Defer only if the small model isn't very confident in its own prediction.
A comparative check: Defer if the large model is significantly more confident than the small model.
A cost-benefit analysis: Defer only if the large model's confidence boost outweighs the "cost" of rejecting the small model's draft.
A token-specific check: Given an "approved list" of the best next words according to the large model (its top-ranked tokens), we defer if the small model's drafted token is not on this list.

This ability to plug in different decision-making logic is what gives speculative cascades their unique blend of speed, quality, and adaptability.

Block diagram illustrating a speculative cascade between a small and large model. As with standard speculative decoding, the drafting process involves auto-regressive sampling from the small drafter model. However, the verification process is different: it considers the combined output distribution of both the small and large models via a deferral rule, rather than solely relying on the large model's output.

Below, we visualize the behaviour of speculative cascading versus speculative decoding on a prompt from the GSM8K dataset. The prompt asks, “Mary has 30 sheep. She gets 1 kg of milk from half of them and 2 kg of milk from the other half every day. How much milk does she collect every day?“ By carefully leveraging the small model's output on certain tokens, speculative cascading can reach a correct solution faster than regular speculative decoding.

Comparison of speculative cascades and speculative decoding on a grade school math question from the GSM8K dataset. The draft tokens are shown in yellow and the verified tokens in red. The speculative cascades approach generates the correct answer, and does so faster than speculative decoding.

Experiments

We tested speculative cascades on a range of benchmarks, including summarization, reasoning, and coding. The results show a clear advantage over speculative decoding. On a standard quality-versus-efficiency graph, speculative cascades consistently provide better trade-offs. This means for the same quality level as speculative decoding, our method is faster, i.e., generates more tokens per call to the larger model.

Speculative cascades variants (blue and orange) achieve better quality-latency trade-offs compared to standard speculative decoding (green star) on math reasoning and summarization tasks. See paper for details.

Towards faster and smarter AI with speculative cascades

As LLMs become more integrated into daily applications, optimizing their performance isn’t just a technical goal, it’s a practical necessity. By rethinking how cascades and speculative decoding can work together, speculative cascades provide a more powerful and flexible tool for developers. This hybrid approach allows for fine-grained control over the cost-quality balance, paving the way for applications that are both smarter and faster.

Acknowledgements

This work is a collaborative effort with Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta and Sanjiv Kumar. We are grateful to Ananda Theertha Suresh and Ziteng Sun for their insightful discussions, and Yale Cong, Mark Simborg, and Kimberly Schwede for their help in crafting this blog.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Speculative cascades — A hybrid approach for smarter, faster LLM inference

Quick links

A deeper look

Different goals, different trade-offs

Speculative cascades: Best of both worlds

Experiments

Towards faster and smarter AI with speculative cascades

Acknowledgements

Quick links

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Speculative cascades — A hybrid approach for smarter, faster LLM inference

Quick links

A deeper look

Different goals, different trade-offs

Speculative cascades: Best of both worlds

Experiments

Towards faster and smarter AI with speculative cascades

Acknowledgements

Quick links

Other posts of interest