- Kevin Hu
- Rohit Prabhavalkar
- Sepand Mavandadi
- Tara N Sainath
- Trevor Deatrick Strohman
- Weiran Wang
- Yanzhang (Ryan) He
Abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work