Jump to Content


We design systems that enable computers to "understand" the world, via a range of modalities including audio, image, and video understanding.

Woman looking at a book


About the team

The Perception team is a group focused on building systems that can interpret sensory data such as image, sound, video, and more. Our research helps power many products across Google; image and video understanding in Search and Google Photos, computational photography for Pixel phones and Google Maps, machine learning APIs for Google Cloud and Youtube, accessibility technologies like Live Transcribe, applications in Nest Hub Max, mobile augmented reality experiences in Duo video calls and more.

We actively contribute to open source and research communities, providing media processing technologies (e.g. Mediapipe) to enable the building of computer vision applications with TensorFlow. Further, we have released several large-scale datasets for machine learning, including AudioSet, AVA, Open Images, and YouTube-8M.

In doing all this, we adhere to AI principles to ensure that these technologies work well for everyone. We value innovation, collaboration, respect, and building an inclusive and diverse team and research community, and we work closely with the PAIR team to build ML Fairness frameworks.

Featured publications

(Almost) Zero-Shot Cross-Lingual Spoken Language Understanding
Manaal Faruqui
Gokhan Tur
Dilek Hakkani-Tur
Larry Heck
Proceedings of the IEEE ICASSP (2018)
Preview abstract Spoken language understanding (SLU) is a component of goal-oriented dialogue systems that aims to interpret user's natural language queries in system's semantic representation format. While current state-of-the-art SLU approaches achieve high performance for English domains, the same is not true for other languages. Approaches in the literature for extending SLU models and grammars to new languages rely primarily on machine translation. This poses a challenge in scaling to new languages, as machine translation systems may not be reliable for several (especially low resource) languages. In this work, we examine different approaches to train a SLU component with little supervision for two new languages -- Hindi and Turkish, and show that with only a few hundred labeled examples we can surpass the approaches proposed in the literature. Our experiments show that training a model bilingually (i.e., jointly with English), enables faster learning, in that the model requires fewer labeled instances in the target language to generalize. Qualitative analysis shows that rare slot types benefit the most from the bilingual training. View details
Aperture Supervision for Monocular Depth Estimation
Pratul Srinivasan
Neal Wadhwa
Ren Ng
CVPR (2018) (to appear)
Preview abstract We present a novel method to train machine learning algorithms to estimate scene depths from a single image, by using the information provided by a camera's aperture as supervision. Prior works use a depth sensor's outputs or images of the same scene from alternate viewpoints as supervision, while our method instead uses images from the same viewpoint taken with a varying camera aperture. To enable learning algorithms to use aperture effects as supervision, we introduce two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. We train a monocular depth estimation network end-to-end to predict the scene depths that best explain these finite aperture images as defocus-blurred renderings of the input all-in-focus image. View details
BLADE: Filter Learning for General Purpose Image Processing
John Isidoro
Sungjoon Choi
Frank Ong
International Conference on Computational Photography (2018)
Preview abstract The Rapid and Accurate Image Super Resolution (RAISR) method of Romano, Isidoro, and Milanfar is a computationally efficient image upscaling method using a trained set of filters. We describe a generalization of RAISR, which we name Best Linear Adaptive Enhancement (BLADE). This approach is a trainable edge-adaptive filtering framework that is general, simple, computationally efficient, and useful for a wide range of image processing problems. We show applications to denoising, compression artifact removal, demosaicing, and approximation of anisotropic diffusion equations. View details
Burst Denoising with Kernel Prediction Networks
Ben Mildenhall
Jiawen Chen
Dillon Sharlet
Ren Ng
Rob Carroll
CVPR (2018) (to appear)
Preview abstract We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data. View details
COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar
Vittorio Ferrari
CVPR (2018) (to appear)
Preview abstract Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classifi- cation and detection works focus on thing classes, less at- tention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (through contextual reasoning); (3) physical attributes, ma- terial types and geometric properties of the scene. To un- derstand stuff and things in context we introduce COCO- Stuff, which augments 120,000 images of the COCO dataset with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff annotation protocol based on superpixels which leverages the original thing annotations. We quantify the speed versus quality trade-off of our protocol and explore the relation be- tween annotation time and boundary complexity. Further- more, we use COCO-Stuff to analyze: (a) the importance of stuff and thing classes in terms of their surface cover and how frequently they are mentioned in image captions; (b) the spatial relations between stuff and things, highlighting the rich contextual relations that make our dataset unique; (c) the performance of a modern semantic segmentation method on stuff and thing classes, and whether stuff is easier to segment than things. View details
Decoding the auditory brain with canonical component analysis
Alain de Cheveigné
Daniel D. E. Wong
Giovanni M. Di Liberto
Jens Hjortkjaer
Malcolm Slaney
Edmund Lalor
NeuroImage (2018)
Preview abstract The relation between a stimulus and the evoked brain response can shed light on perceptual processes within the brain. Signals derived from this relation can also be harnessed to control external devices for Brain Computer Interface (BCI) applications. While the classic event-related potential (ERP) is appropriate for isolated stimuli, more sophisticated “decoding” strategies are needed to address continuous stimuli such as speech, music or environmental sounds. Here we describe an approach based on Canonical Correlation Analysis (CCA) that finds the optimal transform to apply to both the stimulus and the response to reveal correlations between the two. Compared to prior methods based on forward or backward models for stimulus-response mapping, CCA finds significantly higher correlation scores, thus providing increased sensitivity to relatively small effects, and supports classifier schemes that yield higher classification scores. CCA strips the brain response of variance unrelated to the stimulus, and the stimulus representation of variance that does not affect the response, and thus improves observations of the relation between stimulus and response. View details

Highlighted projects