REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION

Balint Miklos; Bogdan Prisacari; Daniel Valcarce; Daria Soboleva; Felix Weissenberger; Julia Proskurnia; Justin Lu; Márius Šajgalík; Ondrej Skopek; Rohit Prabhavalkar; Victor Carbune

REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION

Balint Miklos

Bogdan Prisacari

Daniel Valcarce

Daria Soboleva

Felix Weissenberger

Julia Proskurnia

Justin Lu

Márius Šajgalík

Ondrej Skopek

Rohit Prabhavalkar

Victor Carbune

ICASSP 2021: International Conference on Acoustics, Speech and Signal Processing (2021) (to appear)

Google Scholar

Abstract

We present a novel multi-modal unspoken punctuation prediction system for the English language, which relies on Quasi-Recurrent Neural Networks (QRNNs) applied jointly on the text output from automatic speech recognition and acoustic features.
%
We show significant improvements from adding acoustic features compared to the text-only baseline. Because annotated acoustic data is hard to obtain, we demonstrate that relying on only 20% of human-annotated audio and replacing the rest with synthetic text-to-speech (TTS) predictions, does not suffer from quality loss on LibriTTS corpus.
%
Furthermore, we demonstrate that through data augmentation using TTS models, we can remove human-recorded audio completely and outperform models trained on it.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs