Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati; Chung-Cheng Chiu; James Qin; Jiahui Yu; Niki Parmar; Ruoming Pang; Shibo Wang; Wei Han; Yonghui Wu; Yu Zhang; Zhengdong Zhang

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati

Chung-Cheng Chiu

James Qin

Jiahui Yu

Niki Parmar

Ruoming Pang

Shibo Wang

Wei Han

Yonghui Wu

Yu Zhang

Zhengdong Zhang

(2020) (to appear)

Google Scholar

Abstract

Recently end-to-end transformers and convolution neural networks have shown promising results in Automatic Speech Recognition (ASR), outperforming recurrent neural networks (RNNs). In this work, we study how to combine convolutions and transformers to model both global interactions and the local patterns of an audio sequence in a parameter-efficient way. We propose the convolution-augmented transformer for speech recognition, named \textit{Conformer}. \textit{Conformer} achieves state-of-the-art accuracies while being parameter-efficient, outperforming all previous models in ASR. On the widely used Librispeech benchmark, our model achieves WER of 2.1%/4.3% and 1.9%/3.9% with external language model. Our small sized model with 10M parameters achieves 2.7%/6.3%.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Conformer: Convolution-augmented Transformer for Speech Recognition

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs