Jump to Content
Jonathan Huang

Jonathan Huang

I am a research scientist at Google working on machine learning and computer vision and NLP projects. Most recently, I led the team that won 1st place in the COCO object detection challenge. Prior to Google, I was a postdoctoral fellow working in the Computer Science Department at Stanford University and was supported by an NSF/CRA CI (Computing Innovations) fellowship. At Stanford I was a member of the Geometric Computation Group which is headed by Leonidas Guibas. I was also part of the Lytics Lab, a multidisciplinary group focused on Learning Analytics. A more complete publication list can be found at my personal webpage.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Lijun Yu
    Xiuye Gu
    Rachel Hornung
    Hassan Akbari
    Ming-Chang Chiu
    Josh Dillon
    Agrim Gupta
    Meera Hahn
    Anja Hauth
    David Hendon
    Alonso Martinez
    Grant Schindler
    Huisheng Wang
    Jimmy Yan
    Xuan Yang
    Lu Jiang
    arxiv Preprint (2023) (to appear)
    Preview abstract We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/ View details
    The Auto-Arborist Dataset: A Large-Scale Benchmark for Generalizable, Multimodal Urban Forest Monitoring
    Sara Meghan Beery
    Guanhang Wu
    Trevor Edwards
    Filip Pavetić
    Bo Majewski
    Stan Chan
    John Morgan
    Vivek Mansing Rathod
    CVPR 2022 (2022)
    Preview abstract Urban forests provide significant benefits to urban societies (e.g., cleaner air and water, carbon sequestration, and energy savings among others). However, planning and maintaining these forests is expensive. One particularly costly aspect of urban forest management is monitoring the existing trees in a city: ie tracking tree locations, species, and health. Monitoring efforts are currently based on tree censuses built by human experts, collected at a rate of once every five years or less and costing cities millions of dollars. In this paper we explore the use of computer vision to automatically find, label, and monitor individual trees at a large scale using a combination of street level and aerial imagery. Previous investigations into automating this process focused on small datasets from single cities, covering only common species \cite{Branson2018, sumbul2017fine}. These fail to capture the complexity of the problem, which is both fine-grained and significantly long-tailed, and result in methods which are not applicable to new cities. To address this shortcoming, we introduce a new large scale dataset that joins public tree inventories (maintained by cities) with a large collection of street level and aerial imagery. Our Auto-Arborist dataset contains over 2.5 million trees covering >340 genus level categories from North America and is currently at least two orders of magnitude larger than the closest comparable dataset in the literature. Uniquely, we cover multiple cities (to our knowledge, prior works have restricted their focus to single-city datasets) which allows for analysis of generalization with respect to geographic distribution shifts that were not previously possible. We propose a set of metrics to evaluate performance especially with respect to these geographic distribution shifts and show the strengths and weaknesses of typical deep learning models when applied to the Auto Arborist dataset. We hope our dataset can be an important and exciting new scientific benchmark that will spur progress on the application of computer vision to urban ecology and sustainability. View details
    PERF-Net: Pose Empowered RGB-Flow Net
    Zhichao Lu
    Xuehan Xiong
    IEEE Winter Conference on Applications of Computer Vision (2021)
    Preview abstract In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state-of-the-art performance. In this paper we show the benefits of including yet another stream based on human pose estimated from each frame — specifically by rendering pose on input RGB frames. At first blush, this additional stream may seem redundant given that human pose is fully determined by RGB pixel values — however we show (perhaps surprisingly) that this simple and flexible addition can provide complementary gains. Using this insight, we propose a new model, which we dub PERF-Net (short for Pose Empowered RGB-Flow Net), which combines this new pose stream with the standard RGB and flow based input streams via distillation techniques and show that our model outperforms the state-of-the-art by a large margin in a number of human action recognition datasets while not requiring flow or pose to be explicitly computed at inference time. The proposed pose stream is also part of the winner solution of the ActivityNet Kinetics Challenge 2020. View details
    Preview abstract Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24). View details
    Preview abstract This paper presents a weakly-supervised approach to object instance segmentation. Starting with known or predicted object bounding boxes, we learn object masks by playing a game of cut-and-paste in an adversarial learning setup. A mask generator takes a detection box and Faster R-CNN features, and constructs a segmentation mask that is used to cut-and-paste the object into a new image location. The discriminator tries to distinguish between real objects, and those cut and pasted via the generator, giving a learning signal that leads to improved object masks. We verify our method experimentally using Cityscapes, COCO, and aerial image datasets, learning to segment objects without ever having seen a mask in training. Our method exceeds the performance of existing weakly supervised methods, without requiring hand-tuned segment proposals, and reaches 90% of supervised performance. View details
    Progressive Neural Architecture Search
    Chenxi Liu
    Barret Zoph
    Maxim Neumann
    Jonathan Shlens
    Wei Hua
    Jia Li
    Fei-Fei Li
    Alan Yuille
    ECCV (2018)
    Preview abstract We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet. View details
    Spatially Adaptive Computation Time for Residual Networks
    Dmitry P. Vetrov
    Maxwell Collins
    Michael Figurnov
    Ruslan Salakhutdinov
    Yukun Zhu
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
    Preview abstract This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of ResNet on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the image saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions. View details
    Preview abstract The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task. View details
    Detecting Events and Key Actors in Multi-Person Videos
    Vignesh Ramanathan
    Alexander Gorban
    Li Fei-Fei
    Computer Vision and Pattern Recognition (CVPR) (2016)
    Preview abstract Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players. View details
    Generation and Comprehension of Unambiguous Object Descriptions
    Junhua Mao
    Alexander Toshev
    Oana Camburu
    Computer Vision and Pattern Recognition (2016)
    Preview abstract We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MSCOCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/ mjhucla/Google_Refexp_toolbox. View details
    G-RMI Object Detection
    Anoop Korattikara
    Menglong Zhu
    Vivek Rathod
    Zbigniew Wojna
    2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
    Preview abstract We present our submission to the COCO 2016 Object Detection challenge. View details
    Im2Calories: towards an automated mobile vision food diary
    Austin Myers
    Vivek Rathod
    Anoop Korattikara
    Alex Gorban
    Nathan Silberman
    George Papandreou
    ICCV (2015)
    Preview abstract We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories. The simplest version assumes that the user is eating at a restaurant for which we know the menu. In this case, we can collect images offline to train a multi-label classifier. At run time, we apply the classifier (running on your phone) to predict which foods are present in your meal, and we lookup the corresponding nutritional facts. We apply this method to a new dataset of images from 23 different restaurants, using a CNN-based classifier, significantly outperforming previous work. The more challenging setting works outside of restaurants. In this case, we need to estimate the size of the foods, as well as their labels. This requires solving segmentation and depth / volume estimation from a single image. We present CNN-based approaches to these problems, with promising preliminary results. View details
    What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
    Jonathan Malmaud
    Vivek Rathod
    Andrew Rabinovich
    North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)
    Preview abstract We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest. View details
    Hilbert space embeddings of conditional distributions with applications to dynamical systems
    Le Song
    Alexander J. Smola
    Kenji Fukumizu
    ICML (2009), pp. 121