Are ever larger octopi still influenced by reporting biases?

Abstract

Language models (LMs) trained on raw texts have no access to the real physical world environment. Gordon and Van Durme (2013) argue that they thus suffer from reporting bias, meaning that texts rarely report commonsensical facts about the world but more frequently talk about non-commonsenscical facts/events. If LMs naively overfit to the co-occurrence statistics of training corpora, they learn a biased view of the physical world. While prior studies have repeatedly verified that LMs of smaller scales (e.g. RoBERTa, GTP-2) amplifies reporting bias, it remains unknown whether such trends continue when models are scaled up. We investigate reporting bias in larger language models (LLMs) such as PaLM and GPT-3. Specifically, we query LLMs for the colour of objects, using colour as a representative property for visual commonsense. Surprisingly, we found that LLMs significantly outperform smaller LMs on answering queries about an object's typical colour. We find that LLMs' predictions deviate from corpus co-occurrence statistics induced from resources such as Google Books Ngram and are closer to human judgement. We believe this serves as evidence that larger LMs can overcome reporting bias, rather than showing an inverse scaling function as previously suggested.