OmniNet: Omnidirectional Representations from Transformers

Abstract

This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena.