Dumitru Erhan
PhD in deep learning at University of Montreal. Part of Google Brain's initiatives to solve visual understanding with deep learning. Previously a scientist with Yahoo Labs.
Research Areas
Authored Publications
Sort By
Phenaki: Variable length video generation from open domain textual descriptions
Mohammad Babaeizadeh
Han Zhang
Mohammad Taghi Saffar
Santiago Castro
Julius Kunze
ICLR (2023)
Preview abstract
We present Phenaki, a model capable of realistic video synthesis given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new causal model for learning video representation which compresses the video to a small discrete tokens representation. This tokenizer is auto-regressive in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or story in open domain). To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
View details
Model-Based Reinforcement Learning for Atari
Blazej Osinski
Chelsea Finn
Henryk Michalewski
Konrad Czechowski
Lukasz Mieczyslaw Kaiser
Mohammad Babaeizadeh
Piotr Kozakowski
Piotr Milos
Roy H Campbell
Afroz Mohiuddin
Ryan Sepassi
Sergey Levine
NIPS'18 (2020)
Preview abstract
Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magnitude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play.
View details
VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation
Mohammad Babaeizadeh
Chelsea Finn
Sergey Levine
Laurent Dinh
Diederik P. Kingma
ICLR (2020) (to appear)
Preview abstract
Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video.
View details
Preview abstract
Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex network architectures and highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: maximizing the capacity of a standard convolutional neural network. We perform the first large-scale empirical study of the effect of capacity on video prediction models. In our experiments, we demonstrate our results on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling first-person car driving.
View details
Learning how to explain neural networks: PatternNet and PatternAttribution
Kristof T. Schütt
Maximilian Alber
Klaus-Robet Müller
Sven Dähne
ICLR (2018)
Preview abstract
DeConvNet, Guided BackProp, LRP, were invented to better understand deep neural networks. We show that these methods do not produce the theoretically correct explanation for a linear model. Yet they are used on multi-layer networks with millions of parameters. This is a cause for concern since linear models are simple neural networks. We argue that explanation methods for neural nets should work reliably in the limit of simplicity, the linear models. Based on our analysis of linear models we propose a generalization that yields two explanation techniques (PatternNet and PatternAttribution) that are theoretically sound for linear models and produce improved explanations for deep networks.
View details
Preview abstract
Much of recent research has been devoted to video
prediction and generation, yet most of the previous
works have demonstrated only limited success
in generating videos on short-term horizons. The
hierarchical video prediction method by Villegas
et al. (2017b) is an example of a state-of-the-art
method for long-term video prediction, but their
method is limited because it requires ground truth
annotation of high-level structures (e.g., human
joint landmarks) at training time. Our network
encodes the input frame, predicts a high-level encoding
into the future, and then a decoder with
access to the first frame produces the predicted
image from the predicted encoding. The decoder
also produces a mask that outlines the predicted
foreground object (e.g., person) as a by-product.
Unlike Villegas et al. (2017b), we develop a novel
training method that jointly trains the encoder, the
predictor, and the decoder together without highlevel
supervision; we further improve upon this
by using an adversarial loss in the feature space to
train the predictor. Our method can predict about
20 seconds into the future and provides better results
compared to Denton and Fergus (2018) and
Finn et al. (2016) on the Human 3.6M dataset.
View details
Preview abstract
Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future which leads to low-quality predictions in real-world settings with stochastic dynamics. In contrast, we developed a variational stochastic method for video prediction that predicts a different possible future for each sample of its latent random variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. The TensorFlow-based implementation of our method will be open sourced upon publication.
View details
Preview abstract
Estimating the influence of a given feature to a model prediction is challenging. We introduce ROAR, RemOve And Retrain, a benchmark to evaluate the accuracy of interpretability methods that estimate input feature importance in deep neural networks. We remove a fraction of input features deemed to be most important according to each estimator and measure the change to the model accuracy upon retraining. The most accurate estimator will identify inputs as important whose removal causes the most damage to model performance relative to all other estimators. This evaluation produces thought-provoking results -- we find that several estimators are less accurate than a random assignment of feature importance. However, averaging a set of squared noisy estimators (a variant of a technique proposed by Smilkov et al. (2017)), leads to significant gains in accuracy for each method considered and far outperforms such a random guess.
View details
The (Un)reliability of Saliency methods
Sara Hooker
Julius Adebayo
Maximilian Alber
Kristof T. Schütt
Sven Dähne
NIPS Workshop (2017)
Preview abstract
Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step ---adding a constant shift to the input data--- to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution.
View details
Preview abstract
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.
View details