Google Research

Multiview Transformers for Video Recognition

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (2022)

Abstract

Video understanding often requires reasoning at multiple spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures. We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets. We will release code and pretrained checkpoints to facilitate further research.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work