Multiview Transformers for Video Recognition

Shen Yan; Xuehan Xiong; Anurag Arnab; Zhichao Lu; Mi Zhang; Chen Sun; Cordelia Schmid

Multiview Transformers for Video Recognition

Shen Yan

Xuehan Xiong

Anurag Arnab

Zhichao Lu

Mi Zhang

Chen Sun

Cordelia Schmid

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (2022)

Download Google Scholar

Abstract

Video understanding often requires reasoning at multiple spatiotemporal resolutions.
To this end, we present Multiview Transformers for Video Recognition (MTV).
Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures.
We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets.
We will release code and pretrained checkpoints to facilitate further research.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Multiview Transformers for Video Recognition

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs