Semi-supervised training for End-to-End models via weak distillation

Ruoming Pang
Proc. ICASSP 2019 (to appear)
Google Scholar

Abstract

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM).
However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the training requires joint audio-text pairs, and does not take advantage of a large amount of text-only data used to train the LMs in conventional models.
There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model.
In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data.
To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns.
We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

Research Areas