What Are The Odds? Language Models are Capable of Probabilistic Reasoning
Abstract
Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We find that zero-shot performance varies dramatically across different families of distributions and that performance can be improved significantly by using anchoring examples (shots) from within a distribution, or to a lesser extent across distributions within the same family. For real-world distributions, the absence of in-context examples can be substituted with context from which the LM can retrieve some statistics. Finally, we show that simply providing the mean and standard deviation of real-world distributions improves performance. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we release publicly, including questions about population health, climate, and finance.