A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

Susan Wei
2025

Abstract

This paper investigates the theoretical underpinnings of the widely successful
pretrain-then-adapt strategy for foundation models. We introduce a Bayesian
model selection criterion, termed the downstream free energy, which quantifies
the adaptability of a pretrained checkpoint by measuring, under the downstream
data distribution, the concentration of favorable solutions near the checkpoint.
However, minimizing this downstream free energy is infeasible without access to
downstream data. To address this, we show that under certain conditions, mini-
mizing the upstream free energy – which can be estimated using only upstream
data – can serve as a reliable proxy. We validate this theoretical insight through
preliminary experiments, showing that commonly used pretraining heuristics ef-
fectively lower upstream free energy, leading to better downstream performance.