Ask2Mask: Guided Data Selection for Masked Speech Modeling

Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Pedro Jose Moreno Mengibar; Yu Zhang

Ask2Mask: Guided Data Selection for Masked Speech Modeling

Murali Karthick Baskar

Andrew Rosenberg

Bhuvana Ramabhadran

Pedro Jose Moreno Mengibar

Yu Zhang

IEEE Journal of Selected Topics in Signal Processing (2022)

Download Google Scholar

Abstract

Masked speech modeling (MSM) pre-training methods such as wav2vec2 or w2v-BERT randomly mask speech frames in an utterance and compute losses on the masked instances. While these methods improve performance of Automated Speech Recognition (ASR) systems, they have one major limitation. They generally perform best under matched conditions, i.e., when the data used for pre-training is matched to the data used for fine-tuning. Using out-of-domain (OOD) pre-training data with limited in-domain fine-tuning data from the target domain results in reduced gains. The relative value of in-domain data within a MSM pre-training corpus has not been well-explored in the literature. In this work, we address precisely this limitation. We propose ask2mask, a novel approach to focus on samples relevant to the target domain (in-domain) during pre-training with OOD or any available data. To perform this fine-grained data selection, ATM applies masking only to input frames with high confidence scores obtained from an external classification model. This allows the model to achieve meaningful in-domain representations and simultaneously discard low-confidence frames which could lead to learning erroneous representations. The ATM approach is further extended to focus on utterances with high confidences by scaling the final MSM loss computed for each masked input frame with the utterance-level confidence score. We conduct experiments on two well-benchmarked read speech corpus (Librispeech) and conversational speech corpus (AMI). The results substantiate the efficacy of ATM on significantly improving target domain performance under mismatched conditions while still yielding modest improvements under matched conditions.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Ask2Mask: Guided Data Selection for Masked Speech Modeling

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs