Ask2Mask: Guided Data Selection for Masked Speech Modeling
Abstract
Masked speech modeling (MSM) pre-training methods such as wav2vec2 or w2v-BERT randomly mask speech frames in an utterance and compute losses on the masked instances. While these methods improve performance of Automated Speech Recognition (ASR) systems, they have one major limitation. They generally perform best under matched conditions, i.e., when the data used for pre-training is matched to the data used for fine-tuning. Using out-of-domain (OOD) pre-training data with limited in-domain fine-tuning data from the target domain results in reduced gains. The relative value of in-domain data within a MSM pre-training corpus has not been well-explored in the literature. In this work, we address precisely this limitation. We propose ask2mask, a novel approach to focus on samples relevant to the target domain (in-domain) during pre-training with OOD or any available data. To perform this fine-grained data selection, ATM applies masking only to input frames with high confidence scores obtained from an external classification model. This allows the model to achieve meaningful in-domain representations and simultaneously discard low-confidence frames which could lead to learning erroneous representations. The ATM approach is further extended to focus on utterances with high confidences by scaling the final MSM loss computed for each masked input frame with the utterance-level confidence score. We conduct experiments on two well-benchmarked read speech corpus (Librispeech) and conversational speech corpus (AMI). The results substantiate the efficacy of ATM on significantly improving target domain performance under mismatched conditions while still yielding modest improvements under matched conditions.