Language Model Cascades: Token-Level Uncertainty And Beyond

Neha Gupta

Hari Narasimhan

Ankit Singh Rawat

Wittawat Jitkrittum

Aditya Menon

Sanjiv Kumar

International Conference on Learning Representations (2024)

Download Google Scholar

Abstract

Recent advances in language model (LM) design has yielded a series of models with remarkably improved quality on complex NLP tasks, but significantly in-creased inference cost. A simple strategy to achieve more favourable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a large model is invoked for a few “hard” instances. Typically, “easy” in-stances are those where the small model has high confidence in its prediction.While the principles underpinning effective cascading are well-studied for classification problems, a similar understanding is lacking for generative tasks. The ex-tension of simple ”Chow” rule which defers based on the probability of predicting an answer is not straightforward for generative tasks where the number of output tokens is variable. Moreover, LMs are known to suffer from length bias where longer answers are penalized more as compared to shorter answers which complicates things further. In this work, we initiate a systematic study of deferral rules for cascades for language models. For example, how does one best summarise model confidence across a variable number of output tokens? We show experimentally that there is no one straight forward extension of probability based uncertainty for LMs which works well across all tasks. Via experiments on a range of bench-marks with FLAN-T5 models, we find that incorporating token-level uncertainty can significantly improve the cost-quality tradeoff of cascades. We further show that incorporating embeddings from the smaller model and intermediate layer embeddings from the larger model can further boost performance

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Language Model Cascades: Token-Level Uncertainty And Beyond

Abstract

Meet the teams driving innovation