Multiview Transformers for Video Recognition
Abstract
Video understanding often requires reasoning at multiple spatiotemporal resolutions.
To this end, we present Multiview Transformers for Video Recognition (MTV).
Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures.
We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets.
We will release code and pretrained checkpoints to facilitate further research.
To this end, we present Multiview Transformers for Video Recognition (MTV).
Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes, and can effectively leverage different transformer encoder architectures.
We present thorough ablation studies of our model and achieve state-of-the-art results on five standard datasets.
We will release code and pretrained checkpoints to facilitate further research.