The existing LibriSpeech dataset contains audio recordings of speakers reading snippets from Project Gutenberg e-books. The UserLibri dataset reorganizes LibriSpeech into 107 individual "user" datasets, each containing an average of 50 personalized audio examples with paired transcripts and additional personalized text-only data from the same domain (with an average of 7,000 sentences) for every user.
The LibriSpeech audio dataset is grouped logically into users so that each user contains audio from the same speaker reading from the same book, ensuring that the domain across examples is the same with similar word choice and style. We take the parts of the book without audio recording examples and process them into sentences to use as additional text-only data for the relevant user. The UserLibri data is available as tensorflow datasets for both the audio and text for each user.
Initial speech personalization experiments are done by training personalized language models for each user and combining them with a global speech model via shallow fusion, demonstrating a performance improvement at the per-user level. See more details in our Interspeech 2022 paper.