Query Language Modeling for Voice Search

Johan Schalkwyk
Thorsten Brants
Vida Ha
Boulos Harb
Will Neveitt
Peng Xu
Proceedings of the 2010 IEEE Workshop on Spoken Language Technology, IEEE, pp. 127-132


The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1% for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia.