CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Understanding

Jonathan H. Clark; Dan Garrette; Iulia Turc; John Wieting

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Understanding

Jonathan H. Clark

Dan Garrette

Iulia Turc

John Wieting

Transactions of the Association for Computational Linguistics (2022)

Download Google Scholar

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by >=1 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Understanding

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs