ETC: Encoding Long and Structured Inputs in Transformers
Abstract
Transformer models have advanced the state of the art in many NLP tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key limitations of existing architectures, namely: scaling input length, and ingesting structured inputs. The main innovation is a new global-local attention mechanism between a global memory and the input tokens, which allows scaling attention to longer inputs. We show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training task allows ETC to naturally handle structured data. We achieve new state-of-the-art results on two natural language datasets requiring long and/or structured inputs.