Making Transformers Solve Compositional Tasks
Abstract
Several studies have reported the inability of Transformer models to generalize compositionally. In this paper, we explore the design space of Transformer models, showing that several design decisions, such as the position encodings, decoder type, model architecture, and encoding of the target task imbue Transformers with different inductive biases, leading to better or worse compositional generalization. In particular we show that Transformers can generalize compositionally significantly better than previously reported in the literature if configured appropriately.