Byte-level Machine Reading across Morphologically Varied Languages

Tom Kenter; Llion Jones; Daniel Hewlett

Byte-level Machine Reading across Morphologically Varied Languages

Tom Kenter

Llion Jones

Daniel Hewlett

Association for the Advancement of Artificial Intelligence (www.aaai.org) (2018)

Download Google Scholar

Abstract

The machine reading task, where a computer reads a document and answers questions about it, is important in artificial intelligence research. Recently, many models have been proposed
to address it. Word-level models, which have words as units of input and output, have proven to yield state-of-the art results when evaluated on English datasets. However, in morphologically richer languages, many more unique words exist than in English due to highly productive prefix and suffix mechanisms. This may set back word-level models, since vocabulary sizes too big to allow for efficient computing may have to be employed. Multiple alternative input granularities have been proposed to avoid large input vocabularies, such as morphemes, character n-grams, and bytes. Bytes
are advantageous as they provide a universal encoding format across languages, and allow for a small vocabulary size, which, moreover, is identical for every input language.
In this work, we investigate whether bytes are suitable as input units across morphologically varied languages. To test this, we introduce two large-scale machine reading datasets in morphologically
rich languages, Turkish and Russian. We implement 4 byte-level models, representing the major types of machinereading models and introduce a new seq2seq variant, called encoder-transformer-decoder. We show that, for all languages considered, there are models reading bytes outperforming the current state-of-the-art word-level baseline. Moreover, the newly introduced encoder-transformer-decoder performs best on the morphologically most involved dataset, Turkish. The large-scale Turkish and Russian machine reading datasets are released to public.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Byte-level Machine Reading across Morphologically Varied Languages

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs