Appearance-and-Relation Networks for Video Classification

Limin Wang; Wei Li; Wen Li; Luc Van Gool

Appearance-and-Relation Networks for Video Classification

Limin Wang

Wei Li

Wen Li

Luc Van Gool

arXiv (2017)

Download Google Scholar

Abstract

Spatiotemporal feature learning in videos is a fundamental and difficult problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Networks (ARTNets), to learn video representation in an end-to-end manner. ARTNet is constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART block decouples the problem of spatiotemporal feature learning into an appearance branch for spatial modeling and a relation branch for temporal modeling. Appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART block obtains an evident improvement over 3D convolution for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods. The code is at https://github.com/wanglimin/ARTNet

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Appearance-and-Relation Networks for Video Classification

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs