Convolutional Transformer for Neural Speech Coding

Hong-Goo Kang; Jan Skoglund; Bastiaan Kleijn; Michael Chinen

Convolutional Transformer for Neural Speech Coding

Hong-Goo Kang

Jan Skoglund

Bastiaan Kleijn

Michael Chinen

Audio Engineering Society Convention 155 (2023)

Google Scholar

Abstract

In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture.
We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality.
This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech.
This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality.
Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Convolutional Transformer for Neural Speech Coding

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs