The need for understanding periodic videos is pervasive. Videos of biological processes, manufacturing processes, people exercising, objects being manipulated are only a few examples where the respective fields would benefit greatly if they were able to process periodic videos automatically.
We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in leveraging temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen videos in the wild. We train this model with a synthetic dataset from a large unlabeled video dataset by sampling short clips of varying lengths and repeating them with different periods. However, simply training powerful video classification models on this synthetic dataset doesn't transfer to real videos. We constrain the period prediction model to use the self-similarity of temporal representations to ensure that the model generalizes to real videos with repeated actions. This combination of synthetic data and a powerful yet constrained model allows us to predict periods in a class-agnostic fashion.
Our repetition counting model substantially exceeds the state of the art performance on existing periodicity benchmarks. We also collect a new challenging dataset called Countix which is more difficult than the existing datasets, capturing difficulties in repetition counting in videos in the real-world. We present extensive experiments on this dataset and hope this encourages more research in this important problem.