We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a fivepoint scale. We trained three models following different deep learning approaches and evaluated them on ∼94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers, ∼2300 samples).View details
This study investigates the performance of personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data. We trained personalized models for 195 individuals with different types and severities of speech impairment with training sets ranging in size from <1 minute to 18-20 minutes of speech data. Word error rate (WER) thresholds were selected to determine success rates (the percentage of personalized models reaching the target WER) in different application scenarios. For the home automation scenario, 79% of speakers reached the target WER with 18-20 minutes of speech; but even with only 3-4 minutes of speech, 63% of speakers reached the target WER. Further evaluation found similar improvement on test sets with out-of-domain, unprompted phrases. Our results demonstrate that with only a few minutes of recordings, individuals with disordered speech could benefit from personalized ASR.View details
Proc. 1st Workshop on Speech for Social Good (S4SG) (2022), pp. 26-30 (to appear)
Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers. It is hard to determine if models can be be useful at such high error rates. This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness. Both BERTScore and WER were compared to prediction errors manually annotated by Speech Language Pathologists for error type and assessment. BERTScore was found to be more correlated with human assessment of error type and assessment. BERTScore was specifically more robust to orthographic changes (contraction and normalization errors) where meaning was preserved. Furthermore, BERTScore was a better fit of error assessment than WER, as measured using an ordinal logistic regression and the Akaike's Information Criterion (AIC). Overall, our findings suggest that BERTScore can complement WER when assessing ASR model performance from a practical perspective, especially for accessibility applications where models are useful even at lower accuracy than for typical speech.View details
Speech samples from over 1000 individuals with impaired speech have been submitted for Project Euphonia, aimed at improving automated speech recognition for atypical speech. We provide an update on the contents of the corpus, which recently passed 1 million utterances, and review key lessons learned from this project.
The reasoning behind decisions such as phrase set composition, prompted vs extemporaneous speech, metadata and data quality efforts are explained based on findings from both technical and user-facing research.View details
Objective. This study aimed to (1) evaluate the performance of personalized Automatic Speech Recognition (ASR) models on disordered speech samples representing a wide range of etiologies and speech severities, and (2) compare the accuracy of these models to that of speaker-independent ASR models developed on and for typical speech as well as expert human listeners. Methods. 432 individuals with self-reported disordered speech recorded at least 300 short phrases using a web-based application. Word error rates (WER) were computed using three different ASR models and expert human transcribers. Metadata were collected to evaluate the potential impact of participant, atypical speech, and technical factors on recognition accuracy. Results. The accuracy of personalized models for recognizing disordered speech was high (WER: 4.6%), and significantly better than speaker-independent models (WER: 31%). Personalized models also outperformed human transcribers (WER gain: 9%) with relative gains in accuracy as high as 80%. The most significant gain in recognition performance was for the most severely affected speakers. Low SNR and fewer training utterances adversely affected recognition even for speakers with mild speech impairments. Conclusions. Personalized ASR models have significant potential for improving communication for persons with impaired speech.View details
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of a speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective , and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech . We find that the ASR encoder’s embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words.View details
No Results Found
We're always looking for more talented, passionate people.