- Ruben Villegas
- Arkanath Pathak
- Harini Kannan
- Dumitru Erhan
- Quoc Le
- Honglak Lee
Abstract
Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex network architectures and highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: maximizing the capacity of a standard convolutional neural network. We perform the first large-scale empirical study of the effect of capacity on video prediction models. In our experiments, we demonstrate our results on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling first-person car driving.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work