Bridging the Gap for Token-Free Language Models

DK Choe; Rami Al-Rfou; Mandy Guo; Heeyoung Lee; Noah Constant

Bridging the Gap for Token-Free Language Models

DK Choe

Rami Al-Rfou

Mandy Guo

Heeyoung Lee

Noah Constant

(2019)

Download Google Scholar

Abstract

Purely character-based language models have been lagging in quality on large scale datasets, and state-of-the-art language models currently rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the language model is essential to achieving competitive results.
In this paper, we show that, contrary to this conventional wisdom, tokenizer-free language models with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve new state of the art results for tokenizer-free language models, pushing these models to be on par with their word-based counterparts.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Bridging the Gap for Token-Free Language Models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs