Ankur Bapna
I am a Staff Software Engineer on the Brain team. My current research interests include multimodal representation learning for speech and text, massively multilingual modeling and applications of these approaches to translation, ASR, TTS and tasks involving end-to-end speech understanding and generation.
Authored Publications
Sort By
Multimodal Modeling for Spoken Language Identification
Shikhar Bharadwaj
Sriram (Sri) Ganapathy
Sid Dalmia
Wei Han
Yu Zhang
Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
Preview abstract
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
View details
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
WASPAA 2023 (2023) (to appear)
Preview abstract
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.
View details
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
Takaaki Saeki
Zhehuai Chen
Nobuyuki Morioka
Yu Zhang
ICASSP (2023)
Preview abstract
This paper proposes Virtuoso, a massive multilingual speech–text joint learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech–text paired data in low-resource languages. This study extends Maestro, which is a speech–text semi-supervised joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline TTS models in seen languages, and 2) these models can synthesize reasonably good speech for unseen languages where no paired TTS data is available.
View details
Preview abstract
We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
View details
LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
Interspeech 2023 (2023)
Preview abstract
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from [URL-HERE]
View details
Label Aware Speech Representation Learning For Language Identification
Shikhar Bharadwaj
Sriram Ganapathy
Wei Han
Proceedings of Interspeech 2023, pp. 5351-5355
Preview abstract
The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks.
View details
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau
Simran Khanuja
Yu Zhang
Siddharth Dalmia
Clara Rivera
IEEE Spoken Language Technology Workshop (SLT) (2022)
Preview abstract
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
View details
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
Zhehuai Chen
Yu Zhang
Pedro Moreno Mengibar
Nanxin Chen
IEEE SLT (2022)
Preview abstract
Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in~\cite{zhehuai2021} can be leveraged to train a massively multilingual ASR model without any transcribed speech. In most zero resource conditions, lack of transcribed speech also implies lack of lexicons. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero transcribed speech, real-world setting to expand the set of languages covered by ASR models with only unlabeled speech and text in the target languages. We define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations coupled with the effective use of language embeddings, we can dramatically reduce the resource requirements for deploying an ASR model to a new language. On the FLEURS dataset, this approach is able to reduce the CER on languages with no transcribed speech from 64.1\% to 29.6\%, a relative reduction of 54\%. Second, using a subset of Indic languages we show that the proposed method can learn effectively from languages with transcribed speech even when there is limited to no graphemeic overlap with the target languages, reducing the average CER of the target languages from 56.3 to 17.2. We believe this is the first demonstration that competitive ASR performance can be achieved for an unseen language using no language resources other than text and untranscribed speech.
View details
Joint Unsupervised and Supervised Training for Multilingual ASR
Yu Zhang
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2022), pp. 6402-6406
Preview abstract
Self-supervised training has been showing promising gains in pretraining models and facilitating the downstream finetuning for speech recognition. Effective self-supervised losses designed for large-scale unlabeled data can help learn the useful latent structures. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. However, the pretrained checkpoint selection is known to be tricky and tedious, and pure finetuning can cause catastrophic forgetting of the learnt representations. To address these concerns, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We apply our method to a challenging multilingual automatic speech recognition (ASR) task and validate its performance on the public dataset \textit{Multilingual LibriSpeech} (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art (SOTA) methods by 10\%, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms monolingual baselines by 33.3\%, and the state-of-the-art 2-stage XLSR by 32\%. On low-resource language like Polish, our WER is less than half of the monolingual WER baseline and even beats the supervised transfer learning method using external supervision.
View details
XTREME-S: Evaluating Cross-lingual Speech Representations
Clara E. Rivera
Mihir Sanjay Kale
Sebastian Ruder
Simran Khanuja
Ye Jia
Yu Zhang
Proc. Interspeech 2022
Preview abstract
We introduce \xtremes, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, retrieval and speech-to-text translation. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in ``universal'' speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. The code and pre-processing scripts will be made publicly available.\footnote{\small\url{https://huggingface.co/datasets/google/xtreme_s}}
View details