Multitask Training with Text Data for End-to-End Speech Recognition

Peidong Wang

Tara N Sainath

Ron J. Weiss

Interspeech (2021) (to appear)

Download Google Scholar

Abstract

We propose a multitask training method for attention-basedend-to-end speech recognition models. We regularize the de-coder in a listen, attend, and spell model by multitask trainingon both audio-text and text-only data. Trained on the 100-hoursubset of LibriSpeech, the proposed method leads to an 11%relative performance improvement over the baseline and is com-parable to language model shallow fusion, without requiring anadditional neural network during decoding. We observe a simi-lar trend on the whole 960-hour LibriSpeech training set. Anal-yses of sample output sentences demonstrate that the proposedmethod can incorporate language level information, suggestingits effectiveness in real-world applications

Research Areas

Speech Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Multitask Training with Text Data for End-to-End Speech Recognition

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Multitask Training with Text Data for End-to-End Speech Recognition

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities