Jump to Content
Jay Yagnik

Jay Yagnik

Jay Yagnik is currently a Vice President and Engineering Fellow at Google, leading large parts of Google AI. While at Google he has led many foundational research efforts in machine learning and perception, computer vision, video understanding, privacy preserving machine learning, quantum AI, applied sciences, and more. He also created multiple engineering and product successes for the company, in areas including Google Photos, YouTube, Search, Ads, Android, Maps, and Hardware. Jay’s research interests span the fields of deep learning, reinforcement learning, scalable matching, graph information propagation, image representation and recognition, temporal information mining, and sparse networks.

Jay is an alumnus of the Indian Institute of Science and the Institute of Technology, Nirma University for graduate and undergraduate studies.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    SmartChoices: Hybridizing Programming and Machine Learning
    Alexander Daryin
    Thomas Deselaers
    Nikhil Sarda
    Reinforcement Learning for Real Life (RL4RealLife) Workshop in the 36th International Conference on Machine Learning (ICML), (2019)
    Preview abstract We present SmartChoices, an approach to making machine learning (ML) a first class citizen in programming languages which we see as one way to lower the entrance cost to applying ML to problems in new domains. There is a growing divide in approaches to building systems: on the one hand, programming leverages human experts to define a system while on the other hand behavior is learned from data in machine learning. We propose to hybridize these two by providing a 3-call API which we expose through an object called SmartChoice. We describe the SmartChoices-interface, how it can be used in programming with minimal code changes, and demonstrate that it is an easy to use but still powerful tool by demonstrating improvements over not using ML at all on three algorithmic problems: binary search, QuickSort, and caches. In these three examples, we replace the commonly used heuristics with an ML model entirely encapsulated within a SmartChoice and thus requiring minimal code changes. As opposed to previous work applying ML to algorithmic problems, our proposed approach does not require to drop existing implementations but seamlessly integrates into the standard software development workflow and gives full control to the software developer over how ML methods are applied. Our implementation relies on standard Reinforcement Learning (RL) methods. To learn faster, we use the heuristic function, which they are replacing, as an initial function. We show how this initial function can be used to speed up and stabilize learning while providing a safety net that prevents performance to become substantially worse -- allowing for a safe deployment in critical applications in real life. View details
    Deep Networks With Large Output Spaces
    Jonathon Shlens
    Rajat Monga
    International Conference on Learning Representations (2015)
    Preview abstract Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods. View details
    Discriminative Segment Annotation in Weakly Labeled Video
    Kevin Tang
    Li Fei-Fei
    Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2013)
    Preview abstract paper tackles the problem of segment annotation in complex Internet videos. Given a weakly labeled video, we automatically generate spatiotemporal masks for each of the concepts with which it is labeled. This is a particularly relevant problem in the video domain, as large numbers of YouTube videos are now available, tagged with the visual concepts that they contain. Given such weakly labeled videos, we focus on the problem of spatiotemporal segment classification. We propose a straightforward algorithm, CRANE, that utilizes large amounts of weakly labeled video to rank spatiotemporal segments by the likelihood that they correspond to a given visual concept. We make publicly available segment-level annotations for a subset of the Prest et al. dataset and show convincing results. We also show state-of-the-art results on Hartmann et al.'s more difficult, large-scale object segmentation dataset. View details
    Fast, Accurate Detection of 100,000 Object Classes on a Single Machine
    Thomas Dean
    Mark Ruzon
    Mark Segal
    Jonathon Shlens
    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA (2013)
    Preview abstract Many object detection systems are constrained by the time required to convolve a target image with a bank of filters that code for different aspects of an object's appearance, such as the presence of component parts. We exploit locality-sensitive hashing to replace the dot-product kernel operator in the convolution with a fixed number of hash-table probes that effectively sample all of the filter responses in time independent of the size of the filter bank. To show the effectiveness of the technique, we apply it to evaluate 100,000 deformable-part models requiring over a million (part) filters on multiple scales of a target image in less than 20 seconds using a single multi-core processor with 20GB of RAM. This represents a speed-up of approximately 20,000 times - four orders of magnitude - when compared with performing the convolutions explicitly on the same hardware. While mean average precision over the full set of 100,000 object classes is around 0.16 due in large part to the challenges in gathering training data and collecting ground truth for so many classes, we achieve a mAP of at least 0.20 on a third of the classes and 0.30 or better on about 20% of the classes. View details
    Fast, Accurate Detection of 100,000 Object Classes on a Single Machine: Technical Supplement
    Thomas Dean
    Mark Ruzon
    Mark Segal
    Jonathon Shlens
    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Washington, DC, USA (2013)
    Preview abstract In the companion paper published in CVPR 2013, we presented a method that can directly use deformable part models (DPMs) trained as in [Felzenszwalb et al CVPR 2008]. After training, HOG based part filters are hashed, and, during inference, counts of hashing collisions summed over all hash bands serve as a proxy for part-filter / sliding-window dot products, i.e., filter responses. These counts are an approximation and so we take the original HOG-based filters for the top hash counts and calculate the exact dot products for scoring. It is possible to train DPM models not on HOG data but on a hashed WTA [Yagnik et al ICCV 2011] version of this data. The resulting part filters are sparse, real-valued vectors the size of WTA vectors computed from sliding windows. Given the WTA hash of a window, we exactly recover dot products of the top responses using an extension of locality-sensitive hashing. In this supplement, we sketch a method for training such WTA-based models. View details
    The Power of Comparative Reasoning
    Dennis Strelow
    Ruei-Sung Lin
    International Conference on Computer Vision, IEEE (2011)
    Preview abstract Rank correlation measures are known for their resilience to perturbations in numeric values and are widely used in many evaluation metrics. Such ordinal measures have rarely been applied in treatment of numeric features as a representational transformation. We emphasize the benefits of ordinal representations of input features both theoretically and empirically. We present a family of algorithms for computing ordinal embeddings based on partial order statistics. Apart from having the stability benefits of ordinal measures, these embeddings are highly nonlinear, giving rise to sparse feature spaces highly favored by several machine learning methods. These embeddings are deterministic, data independent and by virtue of being based on partial order statistics, add another degree of resilience to noise. These machine-learning-free methods when applied to the task of fast similarity search outperform state-of-theart machine learning methods with complex optimization setups. For solving classification problems, the embeddings provide a nonlinear transformation resulting in sparse binary codes that are well-suited for a large class of machine learning algorithms. These methods show significant improvement on VOC 2010 using simple linear classifiers which can be trained quickly. Our method can be extended to the case of polynomial kernels, while permitting very efficient computation. Further, since the popular MinHash algorithm is a special case of our method, we demonstrate an efficient scheme for computing MinHash on conjunctions of binary features. The actual method can be implemented in about 10 lines of code in most languages (2 lines in MATLAB), and does not require any data-driven optimization. View details
    Preview abstract We present a system that automatically recommends tags for YouTube videos solely based on their audiovisual content. We also propose a novel framework for unsupervised discovery of video categories that exploits knowledge mined from the World-Wide Web text documents/searches. First, video content to tag association is learned by training classifiers that map audiovisual content-based features from millions of videos on YouTube.com to existing uploader-supplied tags for these videos. When a new video is uploaded, the labels provided by these classifiers are used to automatically suggest tags deemed relevant to the video. Our system has learned a vocabulary of over 20,000 tags. Secondly, we mined large volumes of Web pages and search queries to discover a set of possible text entity categories and a set of associated is-A relationships that map individual text entities to categories. Finally, we apply these is-A relationships mined from web text on the tags learned from audiovisual content of videos to automatically synthesize a reliable set of categories most relevant to videos -- along with a mechanism to predict these categories for new uploads. We then present rigorous rating studies that establish that: (a) the average relevance of tags automatically recommended by our system matches the average relevance of the uploader-supplied tags at the same or better coverage and (b) the average precision@K of video categories discovered by our system is 70% with K=5. View details
    SPEC Hashing: Similarity Preserving algorithm for Entropy-based Coding
    Ruei-Sung Lin
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
    A Large-Scale Taxonomic Classification System for Web-based Videos
    Reto Strobl
    John Zhang
    the 11th European Conference on Computer Vision (ECCV 2010)
    Taxonomic Classification for Web-based Videos
    Xiaoyun Wu
    IEEE Conf on Computer Vision and Pattern Recognition (CVPR), IEEE (2010)
    Preview abstract A fast and robust method for video contrast enhancement is presented. The method uses the histogram of each frame, along with upper and lower bounds computed per shot in order to enhance the current frame. This ensures that the artifacts introduced during the enhancement is reduced to a minimum. Traditional methods that do not compute per-shot estimates tend to over-enhance parts of the video such as fades and transitions. Our method does not suffer from this problem, which is essential for a fully automatic algorithm. We present the parameters for our methods which yielded the best human feedback, which showed that out of 208 videos, 203 were enhanced, while the remaining 5 were of too poor quality to be enhanced. Additionally, we present a visual comparison of our work with the recently-proposed Weighted Thresholded Histogram Equalization (WTHE) algorithm. View details
    Tour the world: a technical demonstration of a web-scale landmark recognition engine
    Yan-Tao Zheng
    Ulrich Buddemeier
    Fernando Brucher
    Tat-Seng Chua
    MM '09: Proceedings of the seventeen ACM international conference on Multimedia, ACM, New York, NY, USA (2009), pp. 961-962
    Preview abstract This paper discusses a new method for automatic discovery and organization of descriptive concepts (labels) within large real-world corpora of user-uploaded multimedia, such as YouTube.com. Conversely, it also provides validation of existing labels, if any. While training, our method does not assume any explicit manual annotation other than the weak labels already available in the form of video title, descrip- tion, and tags. Prior work related to such auto-annotation assumed that a vocabulary of labels of interest (e.g., indoor, outdoor, city, landscape) is specified a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes audiovisual features of 25 million YouTube.com videos – nearly 150 years of video data – effectively searching for consistent correlation between these features and text metadata. It autonomously extends the label vocabulary as and when it discovers concepts it can reliably identify, eventually leading to a vocabulary with thousands of labels and growing. We believe that this work significantly extends the state of the art in multimedia data mining, discovery, and organization based on the technical merit of the proposed ideas as well as the enormous scale of the mining exercise in a very challenging, unconstrained, noisy domain. View details
    Solving the label resolution problem in supervised video content classification
    Ullas Gargi
    MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, ACM, New York, NY, USA (2008), pp. 276-282
    Learning people annotation from the web via consistency learning
    Atig Islam
    Proc. international Workshop on Multimedia Information Retrieval, ACM, Augsberg, Germany (2007), pp. 285-290
    No Results Found