Hierarchical Long-term Video Prediction without Supervision
Abstract
Much of recent research has been devoted to video
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.