Statistical Separations: When do Transformers outperform feed forward and recurrent networks?
Abstract
While prior works focused on the computational separation between Transformers and fully connected networks, here we focus on a purely statistical separation, and ask: What function class can Transformers learn with fewer samples compared to fully connected networks, even with infinite compute? We define a toy task---q-sparse token retrieval---and prove that the sample complexities of transformers that solve this task is smaller than those of feedforward and recurrent neural networks.