JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION

Yotaro Kubo; Michiel Bacchiani

JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION

Yotaro Kubo

Michiel Bacchiani

Proc. ICASSP 2020 (to appear)

Download Google Scholar

Abstract

This paper proposes methods to improve a commonly used end-to-end speech recognition model, Listen-Attend-Spell (LAS).
The methods we propose use multi-task learning to improve generalization of the model by leveraging information from multiple labels.
The focus in this paper is on multi-task models for simultaneous signal-to-grapheme and signal-to-phoneme conversions while sharing the encoder parameters.
Since phonemes are designed to be a precise description of the linguistic aspects of the speech signal, using phoneme recognition as an auxiliary task can help guiding the early stages of training to be more stable.
In addition to conventional multi-task learning, we obtain further improvements by introducing a method that can exploit dependencies between labels in different tasks. Specifically, the dependencies between phonemes and grapheme sequences are considered. In conventional multi-task learning these sequences are assumed to be independent. Instead, in this paper, a joint model is proposed based on ``iterative refinement'' where dependency modeling is achieved by a multi-pass strategy.
The proposed method is evaluated on a 28000h corpus of Japanese speech data. Performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs