SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

W. Ronny Huang

Cal Peyser

Tara N Sainath

Ruoming Pang

Trevor Deatrick Strohman

Shankar Kumar

Submitted to interspeech 2022 (2022) (to appear)

Download Google Scholar

Abstract

Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora.
However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains.
We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance.
First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences.
Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data.
Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain.
We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection.
When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION

Abstract

Research Areas

Learn more about how we conduct our research