Google Research

MD3: Multi-dialect dataset of dialogues


Natural language processing (NLP) systems are often described by which languages they serve: English, Japanese, Arabic, etc. However, languages are composed of distinct dialects, which sometimes differ significantly from each other. Benchmark datasets generally lack dialect information and often focus on just a single dialect of a language. This makes it impossible to determine whether existing NLP systems perform well across dialects, raising concerns for speakers of "non-standard" dialects that are unlikely to be covered by existing resources. As a step towards addressing these issues, we are building the Multi-Dialect Dataset of Dialogues, or MD3. Our first release focuses on three varieties of global English: Indian English (en-in), Nigerian English (en-ng), and U.S. English (en-us).

Because many dialect features are inhibited in written form, the MD3 dataset is based on spoken dialogues. Our goal was to elicit informal conversational speech from information-sharing activities. To this end, the MD3 conversations are organized around guessing games, in which one speaker (the "describer") must communicate a piece of information to the other (the "guesser"). There are two types of games: a word-guessing game, in which the describer must communicate a word or phrase while avoiding a list of banned words, and an image-guessing game, in which the describer must describe an image well enough for the guesser to select it from a set of twelve similar images.

Our initial release contains roughly 20 hours of audio from the three locales, along with orthographic transcripts, comprising approximately 200,000 words across 3,600 games. We also release metadata about the guessing games that prompted each dialogue. We hope that this dataset will serve as a benchmark for dialect-robust natural language processing and as a resource for the study of global English.