Temporal Reasoning in Videos using Convolutional Gated Recurrent Units
Abstract
Recently, deep learning based models have pushed the state-of-the-art performance for the task of action recognition in videos. Yet, for many large-scale datasets like Kinetics and UCF101, the correct temporal order of frames doesn't seem to be essential to solving the task. We find that the temporal order matters more for the recently introduced 20BN Something-Something dataset where the task of fine-grained action recognition necessitates the model to do temporal reasoning. We show that when temporal order matters, recurrent models can significantly outperform non-recurrent models. This also provides us with an opportunity to inspect the recurrent units using qualitative approaches to get more insight into what they are encoding about actions in videos.