Ultra Low-Bitrate Speech Coding with Pretrained Transformers

Ali Siakoohi
Bastiaan Kleijn
Michael Chinen
Tom Denton
Interspeech 2022
Google Scholar

Abstract

Speech coding facilitates the transmission of speech over low-bandwidth
networks with minimal distortion. Neural-network based speech codecs
have recently demonstrated significant improvements in performance over
traditional approaches. While this new generation of codecs is capable
of synthesizing high-fidelity speech, their use of recurrent or
convolutional layers often restricts their effective receptive fields,
which prevents them from compressing speech efficiently. We propose to
further reduce the bitrate of neural speech codecs through the use of
pretrained Transformers, capable of exploiting long-range dependencies
in the input signal due to their inductive bias. Our numerical
experiments show that supplementing the encoder of a neural speech codec
with Transformer speech embeddings yields a speech codec with a bitrate
of $600\,\mathrm{bps}$ that outperforms the original neural speech codec
in synthesized speech quality when trained at the same bitrate. The
subjective human evaluations also suggest that the perceived quality of
the resulting codec is comparable or better than that of conventional
codecs operating at 3--4 times the rate.