Yair Alon
Prev. Movshovitz-Attias
I'm a staff software engineer in Google A.I. Perception interested in machine learning and computer vision.
My research ranges: Efficient Inference in Deep Models, Model Pruning, Compression, Anomaly Detection, Dense Prediction, and spans a wide range of computer vision problems such as image classification, semantic segmentation, and metric learning.
My research has been published in top-tier conferences such as CVPR, ECCV, and SIGGRAPH. Models using my work were deployed in major Google products such as Google Photos, Pixel Phone Camera, and Street View. I have a PhD in Computer Science from Carnegie Mellon University where was lucky to be co-advised by Yaser Sheikh and Takeo Kanade. More information at my Google Scholar Profile.
I'm a staff software engineer in Google A.I. Perception interested in machine learning and computer vision.
My research ranges: Efficient Inference in Deep Models, Model Pruning, Compression, Anomaly Detection, Dense Prediction, and spans a wide range of computer vision problems such as image classification, semantic segmentation, and metric learning.
My research has been published in top-tier conferences such as CVPR, ECCV, and SIGGRAPH. Models using my work were deployed in major Google products such as Google Photos, Pixel Phone Camera, and Street View. I have a PhD in Computer Science from Carnegie Mellon University where was lucky to be co-advised by Yaser Sheikh and Takeo Kanade. More information at my Google Scholar Profile.
Research Areas
Authored Publications
Sort By
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
Fine-Grained Stochastic Architecture Search
Shraman Ray Chaudhuri
Hanhan Li
Max Moroz
ICLR Workshop on Neural Architecture Search, @article{chaudhuri2020fine,
title={Fine-grained stochastic architecture search},
author={Chaudhuri, Shraman Ray and Eban, Elad and Li, Hanhan and Moroz, Max and Movshovitz-Attias, Yair},
journal={ICLR Workshop on Neural Architecture Search},
year={2020}
} (2020)
Preview abstract
State-of-the-art deep networks are often too large to deploy on mobile devices and
embedded systems. Mobile neural architecture search (NAS) methods automate
the design of small models but state-of-the-art NAS methods are expensive to
run. Differentiable neural architecture search (DNAS) methods reduce the search
cost but explore a limited subspace of candidate architectures. In this paper, we
introduce Fine-Grained Stochastic Architecture Search (FiGS), a differentiable
search method that searches over a much larger set of candidate architectures. FiGS
simultaneously selects and modifies operators in the search space by applying a
structured sparse regularization penalty based on the Logistic-Sigmoid distribution.
We show results across 3 existing search spaces, matching or outperforming the
original search algorithms and producing state-of-the-art parameter-efficient models
on ImageNet (e.g., 75.4% top-1 with 2.6M params). Using our architectures as
backbones for object detection with SSDLite, we achieve significantly higher mAP
on COCO (e.g., 25.8 with 3.0M params) than MobileNetV3 and MnasNet.
View details
Sky Optimization: Semantically aware image processing in low-light photography
Yun-Ta Tsai
Huizhong Chen
(2020), pp. 526-527
Preview abstract
The sky is a major component of the appearance of a photograph, and its color and tone can strongly influence the mood of a picture. In nighttime photography, the sky can also suffer from noise and color artifacts. For this reason, there is a strong desire to process the sky in isolation from the rest of the scene to achieve an optimal look.
In this work, we propose an automated method, which can run as a part of a camera pipeline, for creating accurate sky alpha-masks and using them to improve the appearance of the sky.
Our method performs end-to-end sky optimization in less than half a second per image on a mobile device.
We introduce a method for creating an accurate sky-mask dataset that is based on partially annotated images that are inpainted and refined by our modified weighted guided filter. We use this dataset to train a neural network for semantic sky segmentation.
Due to the compute and power constraints of mobile devices, sky segmentation is performed at a low image resolution. Our modified weighted guided filter is used for edge-aware upsampling to resize the alpha-mask to a higher resolution.
With this detailed mask we automatically apply post-processing steps to the sky in isolation, such as automatic spatially varying white-balance, brightness adjustments, contrast enhancement, and noise reduction.
View details
Preview abstract
Knowledge Distillation is a popular method to reduce model size by transferring the knowledge of a large teacher model to a smaller student network. We show that it is possible to independently replace sub-parts of a network without accuracy loss. Based on this, we propose a distillation method that breaks the end-to-end paradigm by splitting the teacher architecture into smaller sub-networks - also called neighbourhoods. For each neighbourhood we distill a student independently and then merge them into a single student model. We show that this process is significantly faster than Knowledge Distillation, and produces students of the same quality.
From Neighbourhood Distillation, we design Student Search, an architecture search that leverages the independently distilled candidates to explore an exponentially large search space of architectures and locally selects the best candidate to use for the student model.
We show applications of Neighbourhood Distillation and Student Search on CIFAR-10 and ImageNet models on model reduction and sparsification problems. Our method offers up to $4.6\times$ speed-up compared to end-to-end distillation methods while retaining the same performance.
View details
Structured Multi-Hashing for Model Compression
Andrew Poon
Yerlan Idelbayev
Miguel A. Carreira-Perpinan
(2019)
Preview abstract
Despite the success of deep neural networks (DNNs), state-of-the-art models are too large to deploy on low-resource devices or common server configurations in which multiple models are held in memory. Model compression methods address this limitation by reducing the memory footprint, latency, or energy consumption of a model with minimal impact on accuracy. We focus on the task of reducing the number of learnable variables in the model. In this work we combine ideas from weight hashing and dimensionality reductions resulting in a simple and powerful structured multi-hashing method based on matrix products that allows direct control of model size of any deep network and is trained end-to-end. We demonstrate the strength of our approach by compressing models from the ResNet, EfficientNet, and MobileNet architecture families. Our method allows us to drastically decrease the number of variables while maintaining high accuracy. For instance, by applying our approach to EfficentNet-B4 (16M parameters) we reduce it to to the size of B0 (5M parameters), while gaining over 3% in accuracy over B0 baseline. On the commonly used benchmark CIFAR10 we reduce the ResNet32 model by 75% with no loss in quality, and are able to do a 10x compression while still achieving above 90% accuracy.
View details
Synthetic Depth-of-Field with a Single-Camera Mobile Phone
Neal Wadhwa
Rahul Garg
David E. Jacobs
Bryan E. Feldman
Nori Kanazawa
Robert Carroll
Marc Levoy
SIGGRAPH (2018) (to appear)
Preview abstract
Shallow depth-of-field is commonly used by photographers to isolate a subject from a distracting background. However, standard cell phone cameras cannot produce such images optically, as their short focal lengths and small apertures capture nearly all-in-focus images. We present a system to computationally synthesize shallow depth-of-field images with a single mobile camera and a single button press. If the image is of a person, we use a person segmentation network to separate the person and their accessories from the background. If available, we also use dense dual-pixel auto-focus hardware, effectively a 2-sample light field with an approximately 1 millimeter baseline, to compute a dense depth map. These two signals are combined and used to render a defocused image. Our system can process a 5.4 megapixel image in 4 seconds on a mobile phone, is fully automatic, and is robust enough to be used by non-experts. The modular nature of our system allows it to degrade naturally in the absence of a dual-pixel sensor or a human subject.
View details
No Fuss Distance Metric Learning using Proxies
Alexander Toshev
Sergey Ioffe
International Conference on Computer Vision (ICCV), IEEE (2017)
Preview abstract
We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -- an anchor point x is similar to a set of positive points Y, and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized.
While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc. Even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.
View details
Preview abstract
Modern search engines receive large numbers of business
related, local aware queries. Such queries are best
answered using accurate, up-to-date, business listings, that
contain representations of business categories. Creating
such listings is a challenging task as businesses often
change hands or close down. For businesses with street
side locations one can leverage the abundance of street
level imagery, such as Google Street View, to automate the
process. However, while data is abundant, labeled data is
not; the limiting factor is creation of large scale labeled
training data. In this work, we utilize an ontology of geographical
concepts to automatically propagate business
category information and create a large, multi label, training
dataset for fine grained storefront classification. Our
learner, which is based on the GoogLeNet/Inception Deep
Convolutional Network architecture and classifies 208 categories,
achieves human level accuracy.
View details