In this paper we construct a data set for semi-supervised acoustic model training by selecting spoken utterances from a massive collection of anonymized Google Voice Search utterances. Semi-supervised training usually retains high-confidence utterances which are presumed to have an accurate hypothesized transcript, a necessary condition for successful training. Selecting high confidence utterances can however restrict the diversity of the resulting data set. We propose to introduce a constraint enforcing that the distribution of the context-dependent state symbols obtained by running forced alignment of the hypothesized transcript matches a reference distribution estimated from a curated development set. The quality of the obtained training set is illustrated on large scale Voice Search recognition experiments and outperforms random selection of high-confidence utterances.