- Shen Yan
- Xuehan Xiong
- Anurag Arnab
- Zhichao Lu
- Mi Zhang
- Chen Sun
- Cordelia Schmid
Abstract
Video understanding often requires reasoning at multiple spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures. We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets. We will release code and pretrained checkpoints to facilitate further research.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work