Semi-supervised training for End-to-End models via weak distillation

Bo Li; Ruoming Pang; Tara Sainath; Zelin Wu

Semi-supervised training for End-to-End models via weak distillation

Bo Li

Ruoming Pang

Tara Sainath

Zelin Wu

Proc. ICASSP 2019 (to appear)

Google Scholar

Abstract

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM).
However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the training requires joint audio-text pairs, and does not take advantage of a large amount of text-only data used to train the LMs in conventional models.
There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model.
In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data.
To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns.
We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Semi-supervised training for End-to-End models via weak distillation

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs