Apostol (Paul) Natsev
Apostol (Paul) Natsev is a software engineer and manager in the video content analysis group at Google Research. Previously, he was a research staff member and manager of the multimedia research group at IBM Research from 2001 to 2011. He received a master's degree and a Ph.D. in computer science from Duke University, Durham, NC, in 1997 and 2001, respectively. Dr. Natsev's research interests span the areas of image and video analysis and retrieval, machine perception, large-scale machine learning and recommendation systems. He is an author of more than 80 publications and his research has been recognized with several awards.
Research Areas
Authored Publications
Sort By
Large Scale Video Representation Learning via Relational Graph Clustering
Hyodong Lee
Joe Yue-Hei Ng
Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Preview abstract
Representation learning is widely applied for various tasks on multimedia data, e.g., retrieval and search. One approach for learning useful representation is by utilizing the relationships or similarities between examples. In this work, we explore two promising scalable representation learning approaches on video domain. With hierarchical graph clusters built upon video-to-video similarities, we propose: 1) smart negative sampling strategy that significantly boosts training efficiency with triplet loss, and 2) a pseudo-classification approach using the clusters as pseudo-labels. The embeddings trained with the proposed methods are competitive on multiple video understanding tasks, including related video retrieval and video annotation. Both of these proposed methods are highly scalable, as verified by experiments on large-scale datasets.
View details
Large-Scale Training Framework for Video Annotation
Seong Jae Hwang
Balakrishnan Varadarajan
Ariel Gordon
Proc. of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM (2019)
Preview abstract
Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.
View details
Collaborative Deep Metric Learning for Video Understanding
Balakrishnan Varadarajan
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM (2018)
Preview abstract
The goal of video understanding is to develop algorithms that enable machines understand videos at the level of human experts. Researchers have tackled various domains including video classification, search, personalized recommendation, and more. However, there is a research gap in combining these domains in one unified learning framework. Towards that, we propose a deep network that embeds videos using their audio-visual content, onto a metric space which preserves video-to-video relationships. Then, we use the trained embedding network to tackle various domains including video classification and recommendation, showing significant improvements over state-of-the-art baselines. The proposed approach is highly scalable to deploy on large-scale video sharing platforms like YouTube.
View details
The Kinetics Human Action Video Dataset
Andrew Zisserman
Joao Carreira
Karen Simonyan
Will Kay
Brian Zhang
Chloe Hillier
Fabio Viola
Tim Green
Trevor Back
Mustafa Suleyman
arXiv (2017)
Preview abstract
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset.
View details
YouTube-8M: A Large-Scale Video Classification Benchmark
Nisarg Kothari
Balakrishnan Varadarajan
arXiv:1609.08675 (2016)
Preview abstract
Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets.
In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos---500K hours of video---annotated with a vocabulary of 4803 visual entities. To get the videos and their (multiple) labels, we used the YouTube Data APIs. We filtered the video labels (Freebase topics) using both automated and manual curation strategies, including by asking Mechanical Turk workers if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. The dataset contains frame-level features for over 1.9 billion video frames and 8 million videos, making it the largest public multi-label video dataset.
We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using the publicly-available TensorFlow framework. We plan to release code for training a basic TensorFlow model and for computing metrics.
We show that pre-training on large data generalizes to other datasets like Sports-1M and ActivityNet. We achieve state-of-the-art on ActivityNet, improving mAP from 53.8% to 77.8%. We hope that the unprecedented scale and diversity of YouTube-8M will lead to advances in video understanding and representation learning.
View details
Content-based Related Video Recommendations
Nisarg Kothari
Advances in Neural Information Processing Systems (NIPS) Demonstration Track (2016)
Efficient Large Scale Video Classification
Balakrishnan Varadarajan
dblp computer science bibliography, http://dblp.org (2015) (to appear)
Preview abstract
Video classification has advanced tremendously over the recent years. A large part of the improvements in video classification had to do with the work done by the image classification community and the use of deep convolutional networks (CNNs) which produce competitive results with hand- crafted motion features. These networks were adapted to use video frames in various ways and have yielded state of the art classification results. We present two methods that build on this work, and scale it up to work with millions of videos and hundreds of thousands of classes while maintaining a low computational cost. In the context of large scale video processing, training CNNs on video frames is extremely time consuming, due to the large number of frames involved. We propose to avoid this problem by training CNNs on either YouTube thumbnails or Flickr images, and then using these networks' outputs as features for other higher level classifiers. We discuss the challenges of achieving this and propose two models for frame-level and video-level classification. The first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks. We present results on the Sports-1M video dataset (1 million videos, 487 classes) and on a new dataset which has 12 million videos and 150,000 labels.
View details
Tracking Large-Scale Video Remix in Real-World Events
Lexing Xie
Xuming He
John R. Kender
Matthew L. Hill
John R. Smith
IEEE Transactions on Multimedia, 15, no. 6 (2013), pp. 1244-1254
Preview abstract
Content sharing networks, such as YouTube, contain traces of both explicit online interactions (such as likes, comments, or subscriptions), as well as latent interactions (such as quoting, or remixing, parts of a video). We propose visual memes, or frequently re-posted short video segments, for detecting and monitoring such latent video interactions at scale. Visual memes are extracted by scalable detection algorithms that we develop, with high accuracy. We further augment visual memes with text, via a statistical model of latent topics. We model content interactions on YouTube with visual memes, defining several measures of influence and building predictive models for meme popularity. Experiments are carried out with over 2 million video shots from more than 40,000 videos on two prominent news events in 2009: the election in Iran and the swine flu epidemic. In these two events, a high percentage of videos contain remixed content, and it is apparent that traditional news media and citizen journalists have different roles in disseminating remixed content. We perform two quantitative evaluations for annotating visual memes and predicting their popularity. The proposed joint statistical model of visual memes and words outperforms an alternative concurrence model, with an average error of 2% for predicting meme volume and 17% for predicting meme lifespan.
View details
Scene Aligned Pooling for Complex Video Recognition
Liangliang Cao
Yadong Mu
Shih-Fu Chang
Gang Hua
John R. Smith
ECCV (2012), pp. 688-701
Preview abstract
Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.
View details
Multimedia Semantics: Interactions Between Content and Community
Hari Sundaram
Lexing Xie
Munmun De Choudhury
Yu-Ru Lin
Proceedings of the IEEE, 100, no. 9 (2012)
Preview abstract
This paper reviews the state of the art and some emerging issues in research areas related to pattern analysis and monitoring of web-based social communities. This research area is important for several reasons. First, the presence of near-ubiquitous low-cost computing and communication technologies has enabled people to access and share information at an unprecedented scale. The scale of the data necessitates new research for making sense of such content. Furthermore, popular websites with sophisticated media sharing and notification features allow users to stay in touch with friends and loved ones; these sites also help to form explicit and implicit social groups. These social groups are an important source of information to organize and to manage multimedia data. In this article, we study how media-rich social networks provide additional insight into familiar multimedia research problems, including tagging and video ranking. In particular, we advance the idea that the contextual and social aspects of media are as important for successful multimedia applications as is the media content. We examine the interrelationship between content and social context through the prism of three key questions. First, how do we extract the context in which social interactions occur? Second, does social interaction provide value to the media object? Finally, how do social media facilitate the repurposing of shared content and engender cultural memes? We present three case studies to examine these questions in detail. In the first case study, we show how to discover structure latent in the social media data, and use the discovered structure to organize Flickr photo streams. In the second case study, we discuss how to determine the interestingness of conversations---and of participants---around videos uploaded to YouTube. Finally, we show how the analysis of visual content, in particular tracing of content remixes, can help us understand the relationship among YouTube participants. For each case, we present an overview of recent work and review the state of the art. We also discuss two emerging issues related to the analysis of social networks---robust data sampling and scalable data analysis.
View details