How Does Code Pretraining Affect Language Model Task Performance
Abstract
Large language models are typically pretrained on a corpus of natural language text. In recent years, the desire to create language models which can interpret and generate code in different programming languages has led to the inclusion of non-linguistic code in the pretraining corpora for language models. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining corpora may improve performance on other, unrelated tasks: To study this, we pretrain suites of language models on parameterized `code mixture' datasets which interleave natural language and code in two different settings: competitive, in which the total volume of data seen during pretraining is held constant; and additive, in which the volume of language data is held constant. We study how the pretraining mixture affects (a) general reasoning on BigBench tasks, and (b) compositionality, measured by generalization accuracy on finetuned compositional benchmarks. We find that increased code mixtures cause higher performance on compositional and reasoning tasks involving structured formal outputs (like semantic parsing and arithmetic) and, conversely that code harms performance on purely-linguistic or world knowledge tasks.