SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

Cal Peyser
Ruoming Pang
Submitted to interspeech 2022 (2022) (to appear)

Abstract

Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora.
However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains.
We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance.
First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences.
Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data.
Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain.
We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection.
When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus.