Kevin P. Murphy
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Spae: Semantic pyramid autoencoder for multimodal generation with frozen LLMs
Lijun Yu
Zhiruo Wang
Yonatan Bisk
Alex Hauptmann
Lu Jiang
NeurIPS (2023)
Preview abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
View details
Plex: Towards Reliability using Pretrained Large Model Extensions
Du Phan
Mark Patrick Collier
Zi Wang
Zelda Mariet
Clara Huiyi Hu
Neil Band
Tim G. J. Rudner
Joost van Amersfoort
Andreas Christian Kirsch
Rodolphe Jenatton
Honglin Yuan
Kelly Buchanan
Yarin Gal
ICML 2022 Pre-training Workshop (2022)
Preview abstract
A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding.
View details
COVID-19 Open-Data: a global-scale spatially granular meta-dataset for coronavirus disease
Oscar Wahltinez
Aurora Cheung
Ruth Alcantara
Donny Cheung
Mayank Daswani
Anthony Erlinger
Matt Lee
Pranali Yawalkar
Paula Lê
Ofir Picazo Navarro
Scientific Data (2022)
Preview abstract
This paper introduces the COVID-19 Open Dataset (COD), available at goo.gle/covid-19-open-data. A static copy is of the dataset is also available at https://doi.org/10.6084/m9.figshare.c.5399355. This is a very large “meta-dataset” of COVID-related data, containing epidemiological information, from 22,579 unique locations within 232 different countries and independent territories. For 62 of these countries we have state-level data, and for 23 of these countries we have county-level data. For 15 countries, COD includes cases and deaths stratified by age or sex. COD also contains information on hospitalizations, vaccinations, and other relevant factors such as mobility, non-pharmaceutical interventions and static demographic attributes. Each location is tagged with a unique identifier so that these different types of information can be easily combined. The data is automatically extracted from 121 different authoritative sources, using scalable open source software. This paper describes the format and construction of the dataset, and includes a preliminary statistical analysis of its content, revealing some interesting patterns.
View details
Machine Learning on Graphs: A Model and Comprehensive Taxonomy
Ines Chami
Sami Abu-El-Haija
Chris Ré
Journal of Machine Learning Research, vol. 23 (2022), pp. 1-64
Preview abstract
There has been a surge of recent interest in graph representation learning (GRL). GRL methods have generally fallen into three main categories, based on the availability of labeled data. The first, network embedding, focuses on learning unsupervised representations of relational structure. The second, graph regularized neural networks, leverages graphs to augment neural network losses with a regularization objective for semi-supervised learning. The third, graph neural networks, aims to learn differentiable functions over discrete topologies with arbitrary structure. However, despite the popularity of these areas there has been surprisingly little work on unifying the three paradigms. Here, we aim to bridge the gap between network embedding, graph regularization and graph neural networks. We propose a comprehensive taxonomy of GRL methods, aiming to unify several disparate bodies of work. Specifically, we propose the GraphEDM framework, which generalizes popular algorithms for semi-supervised learning (e.g. GraphSage, GCN, GAT), and unsupervised learning (e.g. DeepWalk, node2vec) of graph representations into a single consistent approach. To illustrate the generality of GraphEDM, we fit over thirty existing methods into this framework. We believe that this unifying view both provides a solid foundation for understanding the intuition behind these methods, and enables future research in the area.
View details
Preview abstract
Digital contact tracing apps for COVID, such as the one developed by Google and Apple, need to estimate the risk that a user was infected during a particular exposure, in order to decide whether to notify the user to take precautions, such as entering into quarantine, or requesting a test. Such risk score models contain numerous parameters that must be set by the public health authority. In this paper, we show how to automatically learn these parameters from data.
Our method needs access to exposure and outcome data. Although this data is already being collected (in an aggregated, privacy-preserving way) by several health authorities, in this paper we limit ourselves to simulated data, so that we can systematically study the different factors that affect the feasibility of the approach. In particular, we show that the parameters become harder to estimate when there is more missing data (e.g., due to infections which were not recorded by the app), and when there is model misspecification. Nevertheless, the learning approach outperforms a strong manually designed baseline. Furthermore, the learning approach can adapt even when the risk factors of the disease change, e.g., due to the evolution of new variants, or the adoption of vaccines.
View details
Preview abstract
This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future).
View details
Population Based Optimization for Biological Sequence Design
Zelda Mariet
David Martin Dohan
ICML 2020 (2020)
Preview abstract
The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences.
View details
Model-Based Reinforcement Learning for Biological Sequence Design
David Dohan
Ramya Deshpande
ICLR 2020 (2020)
Preview abstract
Being able to design biological sequences like DNA or proteins to have desired properties would have considerable impact in medical and industrial applications. However, doing so presents a challenging black-box optimization problem that requires multiple rounds of expensive, time-consuming experiments. In response, we propose using reinforcement learning (RL) for biological sequence design. RL is a flexible framework that allows us to optimize generative sequence policies to achieve a variety of criteria, including diversity among high-quality sequences discovered. We use model-based RL to improve sample efficiency, where at each round the policy is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods.
View details
Preview abstract
Resampling is a key component of sample-based recursive state estimation in particle filters. Recent work explores differentiable particle filters for end-to-end learning. However, resampling remains a challenge in these works, as it is inherently non-differentiable. We address this challenge by replacing traditional resampling with a learned neural network resampler. We present a novel network architecture, the particle transformer, and train it for particle resampling using a likelihood-based loss function over sets of particles. Incorporated into a differentiable particle filter, our model can be end-to-end optimized jointly with the other particle filter components via gradient descent. Our results show that our learned resampler outperforms traditional resampling techniques on synthetic data and in a simulated robot localization task.
View details
Preview abstract
Extracting and predicting object structure and dynamics from videos without
supervision is a major challenge in machine learning. To address this challenge,
we adopt a keypoint-based image representation and learn a stochastic dynamics
model of the keypoints. Future frames are reconstructed from the keypoints and
a reference frame. By modeling dynamics in the keypoint coordinate space, we
achieve stable learning and avoid compounding of errors in pixel space. Our
method improves upon unstructured representations both for pixel-level video
prediction and for downstream tasks requiring object-level understanding of motion
dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset,
the Human3.6M dataset, and datasets based on continuous control tasks from
the DeepMind Control Suite. The spatially structured representation outperforms
unstructured representations on a range of motion-related tasks such as object
tracking, action recognition and reward prediction.
View details
Preview abstract
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval.
View details
Floors are flat: Leveraging Semantics for Reliable and Real-Time Surface Normal Prediction
Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Preview abstract
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image.
These insights are: (1) denoise the ”ground truth” surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved state of the art results on several datasets, using a model that runs at 12 fps on a standard mobile phone.
View details
Relational Action Forecasting
Abhinav Shrivastava
Carl Martin Vondrick
CVPR 2019
Preview abstract
This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%.
View details
Biological Sequences Design using Batched Bayesian Optimization
Zelda Mariet
Ramya Deshpande
David Dohan
Olivier Chapelle
NeurIPS workshop on Bayesian Deep Learning (2019)
Preview abstract
Being able to effectively design biological sequences like DNA and proteins would have transformative impact on medicine. Currently, the most popular method in the life sciences for performing design is directed evolution,which explores sequence space by making small mutations to existing sequences.Alternatively, Bayesian optimization (BO) provides an attractive framework for model-based black-box optimization, and has achieved many recent successes in life sciences applications. However, within the ML community, most large-scale BO efforts have focused on hyper-parameter tuning. These methods often do not translate to biological sequence design, where the search space is over a discrete alphabet, wet-lab experiments are run with considerable parallelism (1K-100K sequences per batch), and experiments are sufficiently slow and expensive that only few rounds of experiments are feasible. This paper discusses the particularities of batched BO on a large discrete space, and investigates the design choices that must be made in order to obtain robust, scalable, and experimentally successful models within this unique context.
View details
Modeling Uncertainty with Hedged Instance Embedding
Seong Joon Oh
Jiyan Pan
ICLR 2019 (2019)
Preview abstract
Instance embeddings are an efficient and versatile image representation that facilitates applications like recognition, verification, retrieval, and clustering. Many
metric learning methods represent the input as a single point in the embedding
space. Often the distance between points is used as a proxy for match confidence.
However, this can fail to represent uncertainty which can arise when the input is
ambiguous, e.g., due to occlusion or blurriness. This work addresses this issue and
explicitly models the uncertainty by “hedging” the location of each input in the
embedding space. We introduce the hedged instance embedding (HIB) in which
embeddings are modeled as random variables and the model is trained under the
variational information bottleneck principle (Alemi et al., 2016; Achille & Soatto,
2018). Empirical results on our new N-digit MNIST dataset show that our method
leads to the desired behavior of “hedging its bets” across the embedding space
upon encountering ambiguous inputs. This results in improved performance for
image matching and classification tasks, more structure in the learned embedding
space, and an ability to compute a per-exemplar uncertainty measure which is
correlated with downstream performance.
View details
Preview abstract
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.
View details
Preview abstract
We present a method that learns to integrate temporal information, from a learned
dynamics model, with ambiguous visual information, from a learned vision model,
in the context of interacting agents. Our method is based on a graph-structured
variational recurrent neural network (Graph-VRNN), which is trained end-to-end
to infer the current state of the (partially observed) world, as well as to forecast
future states. We show that our method outperforms various baselines on two sports
datasets, one based on real basketball trajectories, and one generated by a soccer
game engine.
View details
Preview abstract
Humans easily recognize object parts and their hierarchical structure by watching
how they move; they can then predict how each part moves in the future. In this
paper, we propose a novel formulation that simultaneously learns a hierarchical,
disentangled object representation and a dynamics model for object parts from
unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to,
first, recognize the object parts via a layered image representation; second, predict
hierarchy via a structural descriptor that composes low-level concepts into a
hierarchical structure; and third, model the system dynamics by predicting the
future. Experiments on multiple real and synthetic datasets demonstrate that our
PSD model works well on all three tasks: segmenting object parts, building their
hierarchical structure, and capturing their motion distributions.
View details
Fixing a Broken ELBO
Ben Poole
Josh Dillon
Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmässan, Stockholm Sweden (2018), pp. 159-168
Preview abstract
Recent work in unsupervised representation learning has focused on learning deep directed latent variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code.
View details
Actor-Centric Relation Network
Abhinav Shrivastava
Carl Martin Vondrick
ECCV 2018
Preview abstract
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
View details
Preview abstract
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).
View details
PersonLab: Person Pose Estimation and Instance Segmentation with a Part-Based Geometric Embedding Model
George Papandreou
Liang-chieh Chen
Spyros Gidaris
ECCV (2018)
Preview abstract
We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.
View details
Preview abstract
We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform optical flow based methods. Finally, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.
View details
Progressive Neural Architecture Search
Chenxi Liu
Barret Zoph
Maxim Neumann
Jonathan Shlens
Wei Hua
Jia Li
Fei-Fei Li
Alan Yuille
ECCV (2018)
Preview abstract
We propose a new method for learning the structure of convolutional
neural networks (CNNs) that is more efficient than recent
state-of-the-art methods based on reinforcement learning and evolutionary
algorithms. Our approach uses a sequential model-based optimization
(SMBO) strategy, in which we search for structures in order of increasing
complexity, while simultaneously learning a surrogate model to guide the
search through structure space. Direct comparison under the same search
space shows that our method is up to 5 times more efficient than the RL
method of Zoph et al. (2018) in terms of number of models evaluated,
and 8 times faster in terms of total compute. The structures we discover
in this way achieve state of the art classification accuracies on CIFAR-10
and ImageNet.
View details
PixColor: Pixel Recursive Colorization
Ryan Dahl
Mohammad Norouzi
Jonathon Shlens
Proceedings of the 28th British Machine Vision Conference (BMVC) (2017)
Preview abstract
We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test".
View details
Preview abstract
Learning the representation and the similarity metric in an end-to-end fashion with deep networks have demonstrated outstanding results for clustering and retrieval. However, these recent approaches still suffer from the performance degradation stemming from the local metric training procedure which is unaware of the global structure of the embedding space.
We propose a global metric learning scheme for optimizing the deep metric embedding with the learnable clustering function and the clustering metric (NMI) in a novel structured prediction framework.
Our experiments on CUB200-2011, Cars196, and Stanford online products datasets show state of the art performance both on the clustering and retrieval tasks measured in the NMI and Recall@K evaluation metrics.
View details
XGAN: Unsupervised Image-to-Image Translation for many-to-many Mappings
Amelie Royer
Stephan Gouws
Fred Bertsch
ICML Workshop (2017)
Preview abstract
Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset we collected for this purpose is in the process of being released as a new benchmark for semantic style transfer.
View details
Towards Accurate Multi-person Pose Estimation in the Wild
George Papandreou
Nori Kanazawa
Alexander Toshev
CVPR (2017)
Preview abstract
We propose a method for multi-person detection and 2-D keypoint localization (human pose estimation) that achieves state-of-the-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector with an Inception-ResNet architecture. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring.
Our final system achieves average precision of 0.636 on the COCO test-dev set and the 0.628 test-standard sets, outperforming the CMU-Pose winner of the 2016 COCO keypoints challenge. Further, by using additional labeled data we obtain an even higher average precision of 0.668 on the test-dev set and 0.658 on the test-standard set, thus achieving a roughly 10% improvement over the previous best performing method on the same challenge.
View details
Attention-based Extraction of Structured Information from Street View Imagery
Zbigniew Wojna
Alex Gorban
Dar-Shyang Lee
Qian Yu
Julian Ibarz
ICDAR (2017), pp. 8
Preview abstract
We present a neural network model, based on
CNNs, RNNs and attention mechanisms, which achieves 84.04%
accuracy on the challenging French Street Name Signs (FSNS)
dataset, significantly outperforming the previous state of the
art (Smith’16), which achieved 72.46%. Furthermore, our new
method is much simpler and more general than the previous
approach. To demonstrate the generality of our model, we also
apply it to two datasets, derived from Google Street View, in
which the goal is to extract business names from store fronts,
and extract structured date/time information from parking signs.
Finally, we study the speed/accuracy tradeoff that results from
cutting pretrained inception CNNs at different depths and using
them as feature extractors for the attention mechanism. The
resulting model is not only accurate but efficient, allowing it
to be used at scale on a variety of challenging real-world text
extraction problems.
View details
Preview abstract
We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
View details
Preview abstract
We describe a model to induce discriminative image captions based only on generative ground-truth training data. For example, given images and descriptions of “zebras” and “horses”, our system can generate discriminative language that describes the zebra images while capturing the differences with the “horse” images . Producing discriminative language is a foundational problem in the study of pragmatic behavior: Humans can effortlessly repurpose language for being persuasive and effective in communication. We first propose a novel inference procedure based on a reflex speaker and an introspector to induce discrimination between concepts. Intuitively, the reflex speaker models a good utterance for some concept (“zebra”), while the introspector models how discriminative the sentence is between the concepts (“zebra” and “horse”). Unlike previous approaches, the form of our listener has the attractive property of being amenable to joint approximate inference to select utterances that satisfy both the speaker and the introspector, yielding an introspective speaker. We apply our introspective speaker to the CUB-Text dataset to describe why an image contains a particular bird category as opposed to some other closely related bird category and to the MS COCO dataset to generate language that points to one out two semantically similar images. Evaluations with discriminative ground truth collected on CUB and with humans on MSCOCO reveal that our approach outperforms baseline approaches for discrimination. We then draw qualitative insights from our model outputs which suggest that in some cases one may interpret the introspective speaker outputs to be lies in service of the higher goal of discrimination.
View details
Speed and accuracy trade-offs for modern convolutional object detectors
Anoop Korattikara
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
CVPR 2017, Honolulu, Hawaii (2017)
Preview abstract
The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
View details
Detecting Events and Key Actors in Multi-Person Videos
Vignesh Ramanathan
Alexander Gorban
Li Fei-Fei
Computer Vision and Pattern Recognition (CVPR) (2016)
Preview abstract
Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.
View details
G-RMI Object Detection
Anoop Korattikara
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
Preview abstract
We present our submission to the COCO 2016 Object Detection challenge.
View details
Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao
Alexander Toshev
Oana Camburu
Computer Vision and Pattern Recognition (2016)
Preview abstract
We propose a method that can generate an unambiguous
description (known as a referring expression) of a specific
object or region in an image, and which can also comprehend
or interpret such an expression to infer which object
is being described. We show that our method outperforms
previous methods that generate descriptions of objects
without taking into account other potentially ambiguous
objects in the scene. Our model is inspired by recent
successes of deep learning methods for image captioning,
but while image captioning is difficult to evaluate, our task
allows for easy objective evaluation. We also present a new
large-scale dataset for referring expressions, based on MSCOCO.
We have released the dataset and a toolbox for visualization
and evaluation, see https://github.com/
mjhucla/Google_Refexp_toolbox.
View details
Bayesian Dark Knowledge
Anoop Korattikara
Vivek Rathod
Max Welling
Advances in Neural Information Processing Systems (2015)
Preview abstract
We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time).
We describe a method for "distilling" a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [Hernandez-Lobato and Adams, 2015] and an approach based on variational Bayes [Blundell et al., 2015]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time.
View details
Im2Calories: towards an automated mobile vision food diary
Austin Myers
Vivek Rathod
Anoop Korattikara
Alex Gorban
Nathan Silberman
George Papandreou
ICCV (2015)
Preview abstract
We present a system which can recognize the contents
of your meal from a single image, and then predict its nutritional
contents, such as calories. The simplest version
assumes that the user is eating at a restaurant for which we
know the menu. In this case, we can collect images offline
to train a multi-label classifier. At run time, we apply the
classifier (running on your phone) to predict which foods
are present in your meal, and we lookup the corresponding
nutritional facts. We apply this method to a new dataset of
images from 23 different restaurants, using a CNN-based
classifier, significantly outperforming previous work. The
more challenging setting works outside of restaurants. In
this case, we need to estimate the size of the foods, as
well as their labels. This requires solving segmentation and
depth / volume estimation from a single image. We present
CNN-based approaches to these problems, with promising
preliminary results.
View details
What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
Jonathan Malmaud
Vivek Rathod
Andrew Rabinovich
North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)
Preview abstract
We present a novel method for aligning a sequence
of instructions to a video of someone
carrying out a task. In particular, we focus
on the cooking domain, where the instructions
correspond to the recipe. Our technique
relies on an HMM to align the recipe steps
to the (automatically generated) speech transcript.
We then refine this alignment using
a state-of-the-art visual food detector, based
on a deep convolutional neural network. We
show that our technique outperforms simpler
techniques based on keyword spotting. It also
enables interesting applications, such as automatically
illustrating recipes with keyframes,
and searching within a video for events of interest.
View details
Large-Scale Object Classification Using Label Relation Graphs
Jia Deng
Yangqing Jia
Andrea Frome
Samy Bengio
Yuan Li
European Conference on Computer Vision (2014)
Preview abstract
. In this paper we study how to perform object classification in
a principled way that exploits the rich structure of real world labels. We
develop a new model that allows encoding of flexible relations between
labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism
that captures semantic relations between any two labels applied
to the same object: mutual exclusion, overlap and subsumption. We then
provide rigorous theoretical analysis that illustrates properties of HEX
graphs such as consistency, equivalence, and computational implications
of the graph structure. Next, we propose a probabilistic classification
model based on HEX graphs and show that it enjoys a number of desirable
properties. Finally, we evaluate our method using a large-scale
benchmark. Empirical results demonstrate that our model can signifi-
cantly improve object classification by exploiting the label relations.
View details
Preview abstract
Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search--based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa's mother, we could ask the query "who is the mother of Frank Zappa". However, this is likely to return "The Mothers of Invention", which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa's place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.
View details
Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion
Xin Luna Dong
Evgeniy Gabrilovich
Geremy Heitz
Wilko Horn
Ni Lao
Thomas Strohmann
Shaohua Sun
Wei Zhang
The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pp. 601-610
Preview abstract
Recent years have witnessed a proliferation of large-scale
knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s
Satori, and Google’s Knowledge Graph. To increase
the scale even further, we need to explore automatic
methods for constructing knowledge bases. Previous approaches
have primarily focused on text-based extraction,
which can be very noisy. Here we introduce Knowledge
Vault, a Web-scale probabilistic knowledge base that combines
extractions from Web content (obtained via analysis of
text, tabular data, page structure, and human annotations)
with prior knowledge derived from existing knowledge repositories.
We employ supervised machine learning methods for
fusing these distinct information sources. The Knowledge
Vault is substantially bigger than any previously published
structured knowledge repository, and features a probabilistic
inference system that computes calibrated probabilities
of fact correctness. We report the results of multiple studies
that explore the relative utility of the different information
sources and extraction methods.
View details
Machine learning: a probabilistic perspective
MIT Press, Cambridge, MA (2012)
Preview abstract
Today’s Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, using a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
View details
Bayesian structure learning using dynamic programming and MCMC
A Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models
Mohammad Emtiyaz Khan
Shakir Mohamed
Benjamin M. Marlin
Journal of Machine Learning Research - Proceedings Track, vol. 22 (2012), pp. 610-618
Group Sparse Priors for Covariance Estimation
Identifying players in broadcast sports videos using conditional random fields
Multiscale Conditional Random Fields for Semi-supervised Labeling and Classification
Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models
Variational bounds for mixed-data factor analysis
Mohammad Emtiyaz Khan
Benjamin M. Marlin
Guillaume Bouchard
NIPS (2010), pp. 1108-1116
Time-Bounded Sequential Parameter Optimization
Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials
Mark W. Schmidt
Journal of Machine Learning Research - Proceedings Track, vol. 9 (2010), pp. 709-716
Review of "Probabilistic graphical models" by Koller and Friedman
Artif. Intell., vol. 174 (2010), pp. 145-146
Using the forest to see the trees: exploiting context for visual object detection and localization
Causal learning without DAGs
David K. Duvenaud
Daniel Eaton
Mark W. Schmidt
Journal of Machine Learning Research - Proceedings Track, vol. 6 (2010), pp. 177-190
SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors
Rodrigo Goya
Mark G. F. Sun
Ryan D. Morin
Gillian Leung
Gavin Ha
Kimberley C. Wiegand
Janine Senz
Anamaria Crisan
Marco A. Marra
Martin Hirst
David G. Huntsman
Sam Aparicio
Sohrab P. Shah
Bioinformatics, vol. 26 (2010), pp. 730-736
Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm
Mark W. Schmidt
Ewout van den Berg
Michael P. Friedlander
Journal of Machine Learning Research - Proceedings Track, vol. 5 (2009), pp. 456-463
Model-based clustering of array CGH data
Sohrab P. Shah
K-John Cheung Jr.
Nathalie A. Johnson
Guillaume Alain
Randy D. Gascoyne
Douglas E. Horsman
Raymond T. Ng
Bioinformatics, vol. 25 (2009)
Modeling Discrete Interventional Data using Directed Cyclic Graphical Models
An experimental investigation of model-based parameter optimisation: SPO and beyond
Group Sparse Priors for Covariance Estimation
Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Baback Moghaddam
Benjamin M. Marlin
Mohammad Emtiyaz Khan
NIPS (2009), pp. 1285-1293
A Hybrid Conditional Random Field for Estimating the Underlying Ground Surface From Airborne LiDAR Data
Wei-Lwun Lu
James J. Little
Alla Sheffer
Hongbo Fu
IEEE T. Geoscience and Remote Sensing, vol. 47 (2009), pp. 2913-2922
Sparse Gaussian graphical models with unknown block structure
Structure learning in random fields for heart motion abnormality detection
LabelMe: A Database and Web-Based Tool for Image Annotation
Bryan C. Russell
Antonio Torralba
William T. Freeman
International Journal of Computer Vision, vol. 77 (2008), pp. 157-173
Modeling changing dependency structure in multivariate time series
Bayesian structure learning using dynamic programming and MCMC
Learning Graphical Model Structure Using L1-Regularization Paths
A non-myopic approach to visual search
Figure-ground segmentation using a hierarchical conditional random field
Efficient parameter estimation for RNA secondary structure prediction
Mirela Andronescu
Anne Condon
Holger H. Hoos
David H. Mathews
ISMB/ECCB (Supplement of Bioinformatics) (2007), pp. 19-28
Sharing Visual Features for Multiclass and Multiview Object Detection
Antonio Torralba
William T. Freeman
IEEE Trans. Pattern Anal. Mach. Intell., vol. 29 (2007), pp. 854-869
Exact Bayesian structure learning from uncertain interventions
Daniel Eaton
Journal of Machine Learning Research - Proceedings Track, vol. 2 (2007), pp. 107-114
Modeling recurrent DNA copy number alterations in array CGH data
Sohrab P. Shah
Wan L. Lam
Raymond T. Ng
ISMB/ECCB (Supplement of Bioinformatics) (2007), pp. 450-458
Integrating copy number polymorphisms into array CGH analysis using a robust HMM
Sohrab P. Shah
Xiang Xuan
Ronald J. deLeeuw
Mehrnoush Khojasteh
Wan L. Lam
Raymond T. Ng
ISMB (Supplement of Bioinformatics) (2006), pp. 431-439
Object Detection and Localization Using Local and Global Features
Antonio Torralba
Daniel Eaton
William T. Freeman
Toward Category-Level Object Recognition (2006), pp. 382-400
Accelerated training of conditional random fields with stochastic gradient methods
Shared Features for Multiclass Object Detection
Antonio Torralba
William T. Freeman
Toward Category-Level Object Recognition (2006), pp. 345-361
Sharing Features: Efficient Boosting Procedures for Multiclass Object Detection
Contextual Models for Object Detection Using Boosted Random Fields
Representing Hierarchical POMDPs as DBNs for Multi-scale Robot Localization
Context-based vision system for place and object recognition
Graphical Model For Recognizing Scenes and Objects
A coupled HMM for audio-visual speech recognition
Ara V. Nefian
Luhong Liang
Xiaobo Pi
Xiaoxiang Liu
Crusoe Mao
ICASSP (2002), pp. 2013-2016
Dynamic Bayesian Networks for Audio-Visual Speech Recognition
Ara V. Nefian
Luhong Liang
Xiaobo Pi
Xiaoxing Liu
EURASIP J. Adv. Sig. Proc., vol. 2002 (2002), pp. 1274-1288
The Factored Frontier Algorithm for Approximate Inference in DBNs
Linear-time inference in Hierarchical HMMs
Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks
Loopy Belief Propagation for Approximate Inference: An Empirical Study
A Variational Approximation for Bayesian Networks with Discrete and Continuous Latent Variables
UAI (1999), pp. 457-466
A Dynamic Bayesian Network Approach to Figure Tracking using Learned Dynamic Models
Vision-Based Speaker Detection Using Bayesian Networks
Bayesian Map Learning in Dynamic Environments
NIPS (1999), pp. 1015-1021
Learning the Structure of Dynamic Probabilistic Networks
Space-Efficient Inference in Dynamic Probabilistic Networks
Automata-Theoretic Models of Mutation and Alignment