We present a new method to learn video representations from large-scale unlabeled video data. We formulate our unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are also shared across different modalities via distillation. Our formulation allows for the distillation of audio, optical flow and temporal information into a single, RGB-based convolutional neural network. We also compare the effects of using additional unlabeled video data and evaluate our representation learning on standard public video datasets.
We newly introduce the concept of using an evolutionary algorithm to obtain a better multi-modal, multi-task loss function to train the network. AutoML has successfully been applied to architecture search and data augmentation. Here we extend the concept of AutoML to unsupervised representation learning by automatically finding the optimal weighting of tasks for representation learning.