Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia

Bob MacDonald
Rus Heywood
Richard Cave
Katie Seaver
Marilyn Ladewig
Jordan R. Green
Interspeech (2021) (to appear)
Google Scholar

Abstract

Speech samples from over 1000 individuals with impaired speech have been submitted for Project Euphonia, aimed at improving automated speech recognition for atypical speech. We provide an update on the contents of the corpus, which recently passed 1 million utterances, and review key lessons learned from this project.
The reasoning behind decisions such as phrase set composition, prompted vs extemporaneous speech, metadata and data quality efforts are explained based on findings from both technical and user-facing research.