Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers; Ruben Villegas; Dumitru Erhan; Honglak Lee

Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers

Ruben Villegas

Dumitru Erhan

Honglak Lee

ICML (2018)

Download Google Scholar

Abstract

Much of recent research has been devoted to video
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Hierarchical Long-term Video Prediction without Supervision

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs