Personalizing ASR for Dysarthric and Accented Speech with Limited Data

Google Scholar

Abstract

Automatic speech recognition (ASR) systems have dramatically
improved over the last few years. ASR systems are most often trained from ‘typical’ speech, which means that underrepresented groups don’t experience the same level of improvement.
In this paper, we present and evaluate finetuning techniques to
improve ASR for users with non standard speech. We focus
on two types of non standard speech: speech from people with
amyotrophic lateral sclerosis (ALS) and accented speech. We
train personalized models that achieve 62% and 35% relative
WER improvement on these two groups, bringing the absolute
WER for ALS speakers, on a test set of message bank phrases,
to 10% for mild dysarthria and 20% for more serious dysarthria.
We show that 76% of the improvement comes from only 5 min
of training data. Finetuning a particular subset of layers (with
many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state
of the art ASR models for dysarthric speech

Index Terms: speech recognition, personalization, accessibility