Apostol (Paul) Natsev

Apostol (Paul) Natsev

Apostol (Paul) Natsev is a software engineer and manager in the video content analysis group at Google Research. Previously, he was a research staff member and manager of the multimedia research group at IBM Research from 2001 to 2011. He received a master's degree and a Ph.D. in computer science from Duke University, Durham, NC, in 1997 and 2001, respectively. Dr. Natsev's research interests span the areas of image and video analysis and retrieval, machine perception, large-scale machine learning and recommendation systems. He is an author of more than 80 publications and his research has been recognized with several awards.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large Scale Video Representation Learning via Relational Graph Clustering
    Hyodong Lee
    Joe Yue-Hei Ng
    Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract Representation learning is widely applied for various tasks on multimedia data, e.g., retrieval and search. One approach for learning useful representation is by utilizing the relationships or similarities between examples. In this work, we explore two promising scalable representation learning approaches on video domain. With hierarchical graph clusters built upon video-to-video similarities, we propose: 1) smart negative sampling strategy that significantly boosts training efficiency with triplet loss, and 2) a pseudo-classification approach using the clusters as pseudo-labels. The embeddings trained with the proposed methods are competitive on multiple video understanding tasks, including related video retrieval and video annotation. Both of these proposed methods are highly scalable, as verified by experiments on large-scale datasets. View details
    Large-Scale Training Framework for Video Annotation
    Seong Jae Hwang
    Balakrishnan Varadarajan
    Ariel Gordon
    Proc. of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM (2019)
    Preview abstract Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets. View details
    Collaborative Deep Metric Learning for Video Understanding
    Balakrishnan Varadarajan
    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM (2018)
    Preview abstract The goal of video understanding is to develop algorithms that enable machines understand videos at the level of human experts. Researchers have tackled various domains including video classification, search, personalized recommendation, and more. However, there is a research gap in combining these domains in one unified learning framework. Towards that, we propose a deep network that embeds videos using their audio-visual content, onto a metric space which preserves video-to-video relationships. Then, we use the trained embedding network to tackle various domains including video classification and recommendation, showing significant improvements over state-of-the-art baselines. The proposed approach is highly scalable to deploy on large-scale video sharing platforms like YouTube. View details
    The Kinetics Human Action Video Dataset
    Andrew Zisserman
    Joao Carreira
    Karen Simonyan
    Will Kay
    Brian Zhang
    Chloe Hillier
    Fabio Viola
    Tim Green
    Trevor Back
    Mustafa Suleyman
    arXiv (2017)
    Preview abstract We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. View details
    Content-based Related Video Recommendations
    Nisarg Kothari
    Advances in Neural Information Processing Systems (NIPS) Demonstration Track (2016)
    Preview abstract This is a demo of related video recommendations, seeded from random YouTube videos, and based purely on video content signals. Traditional recommendation systems using collaborative filtering (CF) approaches suggest related videos for a given seed based on how many users have watched a particular candidate video right after watching the seed video. This does not take the video content into account but relies on aggregate user behavior. Traditional CF approaches work very well when the seed and the candidate videos are relatively popular – they must be watched in a sequence by many users in order for them to be identified as related by the CF system. In this demo, we focus on the cold-start problem, where either the seed and/or the candidate video are freshly uploaded (or undiscovered) so the CF system cannot identify any related videos for them. Being able to recommend freshly uploaded videos as well as recommend good related videos for fresh video seeds are important for improving freshness and user engagement. We model this as a video content-based similarity learning problem, and learn deep video embeddings trained to predict ground-truth video relationships (identified by a CF co-watch-based system) but using only visual content. The system does not depend on availability on video metadata or any click information, and can generalize to both popular and tail content, as well as new video uploads. It embeds any new video into a 1024-dimensional representation based on its content and pairwise video similarity is computed simply as a dot product in the embedding space. We show that the learned video embeddings generalize beyond simple visual similarity and are able to capture complex semantic relationships. View details
    Preview abstract Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos---500K hours of video---annotated with a vocabulary of 4803 visual entities. To get the videos and their (multiple) labels, we used the YouTube Data APIs. We filtered the video labels (Freebase topics) using both automated and manual curation strategies, including by asking Mechanical Turk workers if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. The dataset contains frame-level features for over 1.9 billion video frames and 8 million videos, making it the largest public multi-label video dataset. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using the publicly-available TensorFlow framework. We plan to release code for training a basic TensorFlow model and for computing metrics. We show that pre-training on large data generalizes to other datasets like Sports-1M and ActivityNet. We achieve state-of-the-art on ActivityNet, improving mAP from 53.8% to 77.8%. We hope that the unprecedented scale and diversity of YouTube-8M will lead to advances in video understanding and representation learning. View details
    Efficient Large Scale Video Classification
    Balakrishnan Varadarajan
    dblp computer science bibliography, http://dblp.org (2015) (to appear)
    Preview abstract Video classification has advanced tremendously over the recent years. A large part of the improvements in video classification had to do with the work done by the image classification community and the use of deep convolutional networks (CNNs) which produce competitive results with hand- crafted motion features. These networks were adapted to use video frames in various ways and have yielded state of the art classification results. We present two methods that build on this work, and scale it up to work with millions of videos and hundreds of thousands of classes while maintaining a low computational cost. In the context of large scale video processing, training CNNs on video frames is extremely time consuming, due to the large number of frames involved. We propose to avoid this problem by training CNNs on either YouTube thumbnails or Flickr images, and then using these networks' outputs as features for other higher level classifiers. We discuss the challenges of achieving this and propose two models for frame-level and video-level classification. The first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks. We present results on the Sports-1M video dataset (1 million videos, 487 classes) and on a new dataset which has 12 million videos and 150,000 labels. View details
    Tracking Large-Scale Video Remix in Real-World Events
    Lexing Xie
    Xuming He
    John R. Kender
    Matthew L. Hill
    John R. Smith
    IEEE Transactions on Multimedia, 15, no. 6 (2013), pp. 1244-1254
    Preview abstract Content sharing networks, such as YouTube, contain traces of both explicit online interactions (such as likes, comments, or subscriptions), as well as latent interactions (such as quoting, or remixing, parts of a video). We propose visual memes, or frequently re-posted short video segments, for detecting and monitoring such latent video interactions at scale. Visual memes are extracted by scalable detection algorithms that we develop, with high accuracy. We further augment visual memes with text, via a statistical model of latent topics. We model content interactions on YouTube with visual memes, defining several measures of influence and building predictive models for meme popularity. Experiments are carried out with over 2 million video shots from more than 40,000 videos on two prominent news events in 2009: the election in Iran and the swine flu epidemic. In these two events, a high percentage of videos contain remixed content, and it is apparent that traditional news media and citizen journalists have different roles in disseminating remixed content. We perform two quantitative evaluations for annotating visual memes and predicting their popularity. The proposed joint statistical model of visual memes and words outperforms an alternative concurrence model, with an average error of 2% for predicting meme volume and 17% for predicting meme lifespan. View details
    Multimedia Semantics: Interactions Between Content and Community
    Hari Sundaram
    Lexing Xie
    Munmun De Choudhury
    Yu-Ru Lin
    Proceedings of the IEEE, 100, no. 9 (2012)
    Preview abstract This paper reviews the state of the art and some emerging issues in research areas related to pattern analysis and monitoring of web-based social communities. This research area is important for several reasons. First, the presence of near-ubiquitous low-cost computing and communication technologies has enabled people to access and share information at an unprecedented scale. The scale of the data necessitates new research for making sense of such content. Furthermore, popular websites with sophisticated media sharing and notification features allow users to stay in touch with friends and loved ones; these sites also help to form explicit and implicit social groups. These social groups are an important source of information to organize and to manage multimedia data. In this article, we study how media-rich social networks provide additional insight into familiar multimedia research problems, including tagging and video ranking. In particular, we advance the idea that the contextual and social aspects of media are as important for successful multimedia applications as is the media content. We examine the interrelationship between content and social context through the prism of three key questions. First, how do we extract the context in which social interactions occur? Second, does social interaction provide value to the media object? Finally, how do social media facilitate the repurposing of shared content and engender cultural memes? We present three case studies to examine these questions in detail. In the first case study, we show how to discover structure latent in the social media data, and use the discovered structure to organize Flickr photo streams. In the second case study, we discuss how to determine the interestingness of conversations---and of participants---around videos uploaded to YouTube. Finally, we show how the analysis of visual content, in particular tracing of content remixes, can help us understand the relationship among YouTube participants. For each case, we present an overview of recent work and review the state of the art. We also discuss two emerging issues related to the analysis of social networks---robust data sampling and scalable data analysis. View details
    Scene Aligned Pooling for Complex Video Recognition
    Liangliang Cao
    Yadong Mu
    Shih-Fu Chang
    Gang Hua
    John R. Smith
    ECCV (2012), pp. 688-701
    Preview abstract Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy. View details