 
                Min Ma
            My work focuses on research and development of automatic speech recognition, large language modeling, multimodal multilingual modeling, etc.
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              Multimodal Modeling for Spoken Language Identification
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Shikhar Bharadwaj
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ankur Bapna
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sriram (Sri) Ganapathy
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vera Axelrod
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sid Dalmia
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Han
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yu Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sandy Ritchie
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
              
  
View details
          
        
      
    
        
          
            
              Label Aware Speech Representation Learning For Language Identification
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Ankur Bapna
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Shikhar Bharadwaj
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sriram Ganapathy
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vera Axelrod
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Han
                      
                    
                  
              
            
          
          
          
          
            Proceedings of Interspeech 2023, pp. 5351-5355
          
          
        
        
        
          
              Preview abstract
          
          
              The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks.
              
  
View details
          
        
      
    
        
          
            
              MASR: Multi-Label Aware Speech Representation
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Anjali Raj
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Shikhar Bharadwaj
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sriram Ganapathy
                      
                    
                  
              
            
          
          
          
          
            2023 Workshop on Automatic Speech Recognition and Understanding (ASRU) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the sideinformation that is often available for a given speech recording. Incorporation of side information in existing techniques is constrained to a specific category of meta-data, thereby imposing limitations. Furthermore, these approaches exhibit inefficiencies in their utilization of such information. In this paper, we propose MASR , a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of external knowledge sources to enhance the utilization of meta-data information. Using MASR representations, we perform evaluation on several downstream tasks such as language identification and speech recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. A key advantage of the MASR is that it can be combined with any choice of SSL method. We perform a detailed analysis on the language identification task which illustrates how the proposed loss function enables the representations to separate closely related languages. We also investigate the application of the proposed approach for other non-semantic tasks such as speaker and emotion recognition.
              
  
View details
          
        
      
    
        
          
            
              XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Sebastian Ruder
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Mihir Sanjay Kale
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Shruti Rijhwani
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jean-Michel Sarr
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Cindy Wang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        John Wieting
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Christo Kirov
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dana L. Dickinson
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Bidisha Samanta
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Connie Tao
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        David Adelani
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vera Axelrod
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Reeve Ingle
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dmitry Panteleev
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
          
          
        
        
        
          
              Preview abstract
          
          
              Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks —  tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.
              
  
View details
          
        
      
    
        
          
            
              FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Alexis Conneau
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Simran Khanuja
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yu Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vera Axelrod
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Siddharth Dalmia
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Clara Rivera
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ankur Bapna
                      
                    
                  
              
            
          
          
          
          
            IEEE Spoken Language Technology Workshop (SLT) (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
              
  
View details
          
        
      
    
        
          
            
              XTREME-S: Evaluating Cross-lingual Speech Representations
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Ankur Bapna
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Clara E. Rivera
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Mihir Sanjay Kale
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sandy Ritchie
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sebastian Ruder
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Simran Khanuja
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ye Jia
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yu Zhang
                      
                    
                  
              
            
          
          
          
          
            Proc. Interspeech 2022
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce \xtremes, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, retrieval and speech-to-text translation. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in ``universal'' speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. The code and pre-processing scripts will be made publicly available.\footnote{\small\url{https://huggingface.co/datasets/google/xtreme_s}}
              
  
View details
          
        
      
    
        
          
            
              Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Chung-Cheng Chiu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Liangliang Cao
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ruoming Pang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Thibault Doutre
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Han
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yu Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhiyun Lu
                      
                    
                  
              
            
          
          
          
          
            ICASSP 2021 (to appear)
          
          
        
        
        
          
              Preview abstract
          
          
              Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models. 
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data.
              
  
View details
          
        
      
    
        
          
            
              Transliteration based approaches to improve code-switched speech recognition performance
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Jesse Emond
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Bhuvana Ramabhadran
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pedro Moreno
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 448-455
          
          
        
        
        
          
              Preview abstract
          
          
              Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.
              
  
View details
          
        
      
    