SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

W. Ronny Huang; Cal Peyser; Tara N Sainath; Ruoming Pang; Trevor Deatrick Strohman; Shankar Kumar

SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

W. Ronny Huang

Cal Peyser

Tara N Sainath

Ruoming Pang

Trevor Deatrick Strohman

Shankar Kumar

Submitted to interspeech 2022 (2022) (to appear)

Download Google Scholar

Abstract

Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora.
However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains.
We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance.
First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences.
Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data.
Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain.
We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection.
When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs