Multiview Transformers for Video Recognition

Shen Yan
Xuehan Xiong
Anurag Arnab
Zhichao Lu
Mi Zhang
The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)(2022)


Video understanding often requires reasoning at multiple spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures. We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets. We will release code and pretrained checkpoints to facilitate further research.