Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati
Chung-Cheng Chiu
James Qin
Jiahui Yu
Niki Parmar
Ruoming Pang
Shibo Wang
Wei Han
Yu Zhang
Zhengdong Zhang
(2020) (to appear)
Google Scholar

Abstract

Recently end-to-end transformers and convolution neural networks have shown promising results in Automatic Speech Recognition (ASR), outperforming recurrent neural networks (RNNs). In this work, we study how to combine convolutions and transformers to model both global interactions and the local patterns of an audio sequence in a parameter-efficient way. We propose the convolution-augmented transformer for speech recognition, named \textit{Conformer}. \textit{Conformer} achieves state-of-the-art accuracies while being parameter-efficient, outperforming all previous models in ASR. On the widely used Librispeech benchmark, our model achieves WER of 2.1%/4.3% and 1.9%/3.9% with external language model. Our small sized model with 10M parameters achieves 2.7%/6.3%.

Research Areas