TyDi QA: A Multilingual Question Answering Benchmark
February 6, 2020
Posted by Jonathan Clark, Research Scientist, Google Research
Quick links
Question answering technologies help people on a daily basis — when faced with a question, such as “Is squid ink safe to eat?”, users can ask a voice assistant or type a search and expect to receive an answer. Last year, we released the English-language Natural Questions dataset to the research community to provide a challenge that reflects the needs of real users. However, there are thousands of different languages, and many of those use very different approaches to construct meaning. For example, while English changes words to indicate one object (“book”) versus many (“books”), Arabic also has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub). In addition, some languages, such as Japanese, do not use spaces between words. Creating machine learning systems that can understand the many ways languages express meaning is challenging, and training such systems requires examples from the diverse the languages to which they will be applied.
To encourage research on multilingual question-answering, today we are releasing TyDi QA, a question answering corpus covering 11 Typologically Diverse languages. Described in our paper, “TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages”, our corpus is inspired by typological diversity, a notion that different languages express meaning in structurally different ways. Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world.
A Typologically Diverse Collection of Languages
TyDi QA includes over 200,000 question-answer pairs from 11 languages representing a diverse range of linguistic phenomena and data challenges. Many of these languages use non-Latin alphabets, such as Arabic, Bengali, Korean, Russian, Telugu, and Thai. Others form words in complex ways, including Arabic, Finnish, Indonesian, Kiswahili, Russian. Japanese uses four alphabets (shown by the four colors in “24時間でのサーキット周回数”) while the Korean alphabet itself is highly compositional. These languages also range from having much available data on the web (English and Arabic) to very little (Bengali and Kiswahili). We expect that systems that can address these challenges will be successful for a very large number of languages.
Creating Realistic Data
Many early QA datasets used by the research community were created by first showing paragraphs to people and then asking them to write questions based on what could be answered from reading the paragraph. However, since people could see the answer while writing each question, this approach yielded questions that often contained the same words as the answer. As a result, machine learning algorithms trained on such data would favor word matching, oblivious to the more nuanced answers required to satisfy users’ needs.
To construct a more natural dataset, we collected questions from people who wanted an answer, but did not know the answer yet. To inspire questions, we showed people an interesting passage from Wikipedia written in their native language. We then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer. This is similar to how your own curiosity might spawn questions about interesting things you see while walking down the street. We encouraged our question writers to let their imaginations run. Does a passage about ice make you think about popsicles in summer? Great! Ask who invented popsicles. Importantly, questions were written directly in each language, not translated, so many questions are unlike those seen in an English-first corpus. One question in Bengali asks, “সফেদা ফল খেতে কেমন?” (What does sapodilla taste like?) Never heard of it? That’s probably because it’s grown much more commonly in India than the U.S.
For each of these questions, we performed a Google Search for the best-matching Wikipedia article in the appropriate language and asked a person to find and highlight the answer within that article. While we expected some interesting divergences between the question and answers when the question writers did not have the answers in front of them, combined with the astonishing breadth of linguistic phenomena in the world’s languages, we found that the situation was even more complex.
For example, in Finnish, there are fascinating examples in which the words day and week are represented very differently in the question and answer. To be successful in selecting this answer sentence out of an entire Wikipedia article, a system needs to be able to recognize the relationship among the Finnish words viikonpäivät, seitsenpäiväinen, and viikko
Making Progress Together as a Research Community
It is our hope that this dataset will push the research community to innovate in ways that will create more helpful question-answering systems for users around the world. To track the community’s progress, we have established a leaderboard where participants can evaluate the quality of their machine learning systems and are also open-sourcing a question answering system that uses the data. Please visit the challenge website to view the leaderboard and learn more.
Acknowledgements
This dataset is the result of a team of many Googlers including (alphabetically) Dan Garrette, Eunsol Choi, Jennimaria Palomaki, Michael Collins, Tom Kwiatkowski, and Vitaly Nikolaev. The Finnish gloss above is by Jennimaria Palomaki.