Google Research

Conformer: Convolution-augmented Transformer for Speech Recognition

(2020) (to appear)


Recently end-to-end transformers and convolution neural networks have shown promising results in Automatic Speech Recognition (ASR), outperforming recurrent neural networks (RNNs). In this work, we study how to combine convolutions and transformers to model both global interactions and the local patterns of an audio sequence in a parameter-efficient way. We propose the convolution-augmented transformer for speech recognition, named \textit{Conformer}. \textit{Conformer} achieves state-of-the-art accuracies while being parameter-efficient, outperforming all previous models in ASR. On the widely used Librispeech benchmark, our model achieves WER of 2.1%/4.3% and 1.9%/3.9% with external language model. Our small sized model with 10M parameters achieves 2.7%/6.3%.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work