Maximum Entropy (MaxEnt) Language Models (LMs) are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework, by optimizing a convex loss function. In addition to their flexibility, a key advantage is their scalability, in terms of model size and the amount of data that can be used during training. We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of speech recognition; (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model.
We test the impact of these approaches on Google's state-of-the-art speech recognizer for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show strong significant improvements on a wide range of domains from using non-linguistic signals. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 wins to loss ratio.