On deconstructing ensemble models

Bill Heavlin
American Statistical Association, Alexandria, VA(2015), pp. 1294-1305 (to appear)
Google Scholar


Consider a prediction problem with correlated predictors. In such a case, the best model specification, that is, the best subset of active predictors, can be ambiguous. In spite of this ambiguity, a forecast that informs a high-stakes decision warrants a compact, informative description of the model that produces it. For forecasts based on ensemble models, such descriptions are not straightforward. Our example considers searches on google.com; each observation consists of one experiment changing the details in how the system responds to user queries. Our predictors measure the changes, relative to a contemporaneous control, of short-term metrics. Our response measures a shift in user behavior observable only after a longer term, also calculated relative to the control. Our ensemble of models comes from a spike-and-slab regression. We represent each ensemble — each model — by its specification, a vector of booleans denoting the active predictors. To each such model we calculate its associated goodness of fit. Applying logic regression to predict goodness of fit as a function of the specification booleans, we obtain a metamodel. As a weighted sum of boolean expressions, the metamodel provides a description that is both parsimonious and illuminating