Ruben Villegas
My main research focuses on generative modeling, self-supervised learning and multimodal learning in the video domain. I am interested in effectively incorporating the time dimension to learn more general representations towards the goal of compositional generalization.Visit my personal website for more details and publications.
Research Areas
Authored Publications
Sort By
Phenaki: Variable length video generation from open domain textual descriptions
Mohammad Babaeizadeh
Han Zhang
Mohammad Taghi Saffar
Santiago Castro
Julius Kunze
ICLR (2023)
Preview abstract
We present Phenaki, a model capable of realistic video synthesis given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new causal model for learning video representation which compresses the video to a small discrete tokens representation. This tokenizer is auto-regressive in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or story in open domain). To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
View details
Preview abstract
Extracting and predicting object structure and dynamics from videos without
supervision is a major challenge in machine learning. To address this challenge,
we adopt a keypoint-based image representation and learn a stochastic dynamics
model of the keypoints. Future frames are reconstructed from the keypoints and
a reference frame. By modeling dynamics in the keypoint coordinate space, we
achieve stable learning and avoid compounding of errors in pixel space. Our
method improves upon unstructured representations both for pixel-level video
prediction and for downstream tasks requiring object-level understanding of motion
dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset,
the Human3.6M dataset, and datasets based on continuous control tasks from
the DeepMind Control Suite. The spatially structured representation outperforms
unstructured representations on a range of motion-related tasks such as object
tracking, action recognition and reward prediction.
View details
Preview abstract
Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex network architectures and highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: maximizing the capacity of a standard convolutional neural network. We perform the first large-scale empirical study of the effect of capacity on video prediction models. In our experiments, we demonstrate our results on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling first-person car driving.
View details
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner
Timothy Lillicrap
David Ha
Honglak Lee
James Davidson
International Conference on Machine Learning (2019)
Preview abstract
Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this using a latent dynamics model with both deterministic and stochastic transition components. Moreover, we propose a multi-step variational inference objective that we name latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms.
View details
Preview abstract
Much of recent research has been devoted to video
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.
View details
Preview abstract
We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.
View details
Preview abstract
We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture the human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatio-temporal dynamics for pixel-level future prediction in natural videos.
View details