Google Research

Improving Deliberation by Text-Only and Semi-Supervised Training

Interspeech 2022 (2022) (to appear)


Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work