Sudheendra Vijayanarasimhan
Sudheendra received his Ph.D. in Computer Science from the University of Texas at Austin in 2011 where his research focused on Active learning in Computer Vision and joined Google in 2011. His current work at Google Research is focused on Video Classification and Action Recognition.
His past projects include scaling up object detection and neural networks classification and the design of neural-network architectures for video classification.
Research Areas
Authored Publications
Sort By
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Jia Deng
Yu-Wei Chao
CVPR 2018
Preview abstract
We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions
for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localiza-
tion on THUMOS’14 detection benchmark and competitive performance on ActivityNet challenge.
View details
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Carl Martin Vondrick
Jitendra Malik
CVPR (2018)
Preview abstract
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
View details
Self-Supervised Learning of Structure and Motion from Video
Aikaterini Fragkiadaki
arxiv (2017)
Preview abstract
We propose SfM-Net, a geometry-aware neural network
for motion estimation in videos that decomposes frame-toframe
pixel motion in terms of scene and object depth, camera
motion and 3D object rotations and translations. Given
a sequence of frames, SfM-Net predicts depth, segmentation,
camera and rigid object motions, converts those into
a dense frame-to-frame motion field (optical flow), differentiably
warps frames in time to match pixels and backpropagates.
The model can be trained with various degrees
of supervision: 1) completely unsupervised, 2) supervised
by ego-motion (camera motion), 3) supervised by
depth (e.g., as provided by RGBD sensors), 4) supervised
by ground-truth optical flow. We show that SfM-Net successfully
estimates segmentation of the objects in the scene,
even though such supervision is never provided. It extracts
meaningful depth estimates or infills depth of RGBD sensors
and successfully estimates frame-to-frame camera displacements.
SfM-Net achieves state-of-the-art optical flow
performance. Our work is inspired by the long history of
research in geometry-aware motion estimation, Simultaneous
Localization and Mapping (SLAM) and Structure from
Motion (SfM). SfM-Net is an important first step towards
providing a learning-based approach for such tasks. A major
benefit over the existing optimization approaches is that
our proposed method can improve itself by processing more
videos, and by learning to explicitly model moving objects
in dynamic scenes.
View details
The Kinetics Human Action Video Dataset
Andrew Zisserman
Joao Carreira
Karen Simonyan
Will Kay
Brian Zhang
Chloe Hillier
Fabio Viola
Tim Green
Trevor Back
Mustafa Suleyman
arXiv (2017)
Preview abstract
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset.
View details
End-to-End Learning of Semantic Grasping
Eric Jang
Julian Ibarz
Peter Pastor Sampedro
Sergey Levine
CoRL 2017 (2017) (to appear)
Preview abstract
We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A ``ventral stream'' recognizes object class while a ``dorsal stream'' simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach exhibits an improvement in accuracy over grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.
View details
YouTube-8M: A Large-Scale Video Classification Benchmark
Nisarg Kothari
Balakrishnan Varadarajan
arXiv:1609.08675 (2016)
Preview abstract
Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets.
In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos---500K hours of video---annotated with a vocabulary of 4803 visual entities. To get the videos and their (multiple) labels, we used the YouTube Data APIs. We filtered the video labels (Freebase topics) using both automated and manual curation strategies, including by asking Mechanical Turk workers if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. The dataset contains frame-level features for over 1.9 billion video frames and 8 million videos, making it the largest public multi-label video dataset.
We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using the publicly-available TensorFlow framework. We plan to release code for training a basic TensorFlow model and for computing metrics.
We show that pre-training on large data generalizes to other datasets like Sports-1M and ActivityNet. We achieve state-of-the-art on ActivityNet, improving mAP from 53.8% to 77.8%. We hope that the unprecedented scale and diversity of YouTube-8M will lead to advances in video understanding and representation learning.
View details
Beyond Short Snippets: Deep Networks for Video Classification
Joe Yue-Hei Ng
Matthew Hausknecht
Rajat Monga
Computer Vision and Pattern Recognition (2015)
Preview abstract
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).
View details
Deep Networks With Large Output Spaces
Jonathon Shlens
Rajat Monga
International Conference on Learning Representations (2015)
Preview abstract
Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods.
View details
Efficient Large Scale Video Classification
Balakrishnan Varadarajan
dblp computer science bibliography, http://dblp.org (2015) (to appear)
Preview abstract
Video classification has advanced tremendously over the recent years. A large part of the improvements in video classification had to do with the work done by the image classification community and the use of deep convolutional networks (CNNs) which produce competitive results with hand- crafted motion features. These networks were adapted to use video frames in various ways and have yielded state of the art classification results. We present two methods that build on this work, and scale it up to work with millions of videos and hundreds of thousands of classes while maintaining a low computational cost. In the context of large scale video processing, training CNNs on video frames is extremely time consuming, due to the large number of frames involved. We propose to avoid this problem by training CNNs on either YouTube thumbnails or Flickr images, and then using these networks' outputs as features for other higher level classifiers. We discuss the challenges of achieving this and propose two models for frame-level and video-level classification. The first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks. We present results on the Sports-1M video dataset (1 million videos, 487 classes) and on a new dataset which has 12 million videos and 150,000 labels.
View details
Fast, Accurate Detection of 100,000 Object Classes on a Single Machine: Technical Supplement
Thomas Dean
Mark Ruzon
Mark Segal
Jonathon Shlens
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA (2013)
Preview abstract
In the companion paper published in CVPR 2013, we presented a method that can directly use deformable part models (DPMs) trained as in [Felzenszwalb et al CVPR 2008]. After training, HOG based part filters are hashed, and, during inference, counts of hashing collisions summed over all hash bands serve as a proxy for part-filter / sliding-window dot products, i.e., filter responses. These counts are an approximation and so we take the original HOG-based filters for the top hash counts and calculate the exact dot products for scoring.
It is possible to train DPM models not on HOG data but on a hashed WTA [Yagnik et al ICCV 2011] version of this data. The resulting part filters are sparse, real-valued vectors the size of WTA vectors computed from sliding windows. Given the WTA hash of a window, we exactly recover dot products of the top responses using an extension of locality-sensitive hashing. In this supplement, we sketch a method for training such WTA-based models.
View details