Luciano Sbaiz
Luciano Sbaiz is a Research Scientist at Google. He joined in 2008 and worked on problems of video content analysis, machine learning, and monetization.
Prior to Google, he was a Research and Teaching Associate at EPFL, Lausanne, where he worked on problems of signal processing with application to image acquisition, tomography, and acoustics.
Research Areas
Authored Publications
Sort By
Flexible Multi-task Networks by Learning Parameter Allocation
Krzysztof Maziarz
Jesse Berent
ICLR 2021 Workshop on Neural Architecture Search (2021)
Preview abstract
Multi-task neural networks, when trained successfully, can learn to leverage related concepts from different tasks by using weight sharing. Sharing parameters between highly unrelated tasks can hurt both of them, so a strong multi-task model should be able to control the amount of weight sharing between pairs of tasks, and flexibly adapt it to their relatedness. In recent works, routing networks have shown strong performance in a variety of settings, including multi-task learning. However, optimization difficulties often prevent routing models from unlocking their full potential. In this work, we propose a novel routing method, specifically designed for multi-task learning, where routing is optimized jointly with the model parameters by standard backpropagation. We show that it can discover related pairs of tasks, and improve accuracy over strong baselines. In particular, on multi-task learning for the Omniglot dataset our method reduces the state-of-the-art error rate by $17\%$.
View details
Multi-path Neural Networks for On-device Multi-domain Visual Classification
Andrew Howard
Gabriel M. Bender
Grace Chu
Jeff Gilbert
Joshua Greaves
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021), pp. 3019-3028
Preview abstract
Learning multiple domains/tasks with a single model is important for improving data efficiency and lowering inference cost for numerous vision tasks, especially on resource-constrained mobile devices. However, hand-crafting a multi-domain/task model can be both tedious and challenging. This paper proposes a novel approach to automatically learn a multi-path network for multi-domain visual classification on mobile devices. The proposed multi-path network is learned from neural architecture search by applying one reinforcement learning controller for each domain to select the best path in the super-network created from a MobileNetV3-like search space. An adaptive balanced domain prioritization algorithm is proposed to balance optimizing the joint model on multiple domains simultaneously. The determined multi-path model selectively shares parameters across domains in shared nodes while keeping domain-specific parameters within non-shared nodes in individual domain paths. This approach effectively reduces the total number of parameters and FLOPS, encouraging positive knowledge transfer while mitigating negative interference across domains. Extensive evaluations on the Visual Decathlon dataset demonstrate that the proposed multi-path model achieves state-of-the-art performance in terms of accuracy, model size, and FLOPS against other approaches using MobileNetV3-like architectures. Furthermore, the proposed method improves average accuracy over learning single-domain models individually, and reduces the total number of parameters and FLOPS by 78% and 32% respectively, compared to the approach that simply bundles single-domain models for multi-domain learning.
View details
Fast Task-Aware Architecture Inference
Anja Hauth
Jesse Berent
https://arxiv.org/abs/1902.05781 (2019)
Preview abstract
Neural architecture search has been shown to hold great promise towards the automation of deep learning. However in spite of its potential, neural architecture search remains quite costly. To this point, we propose a novel gradient-based framework for efficient architecture search by sharing information across several tasks. We start by training many model architectures on several related (training) tasks. When a new unseen task is presented, the framework performs architecture inference in order to quickly identify a good candidate architecture, before any model is trained on the new task. At the core of our framework lies a deep value network that can predict the performance of input architectures on a task by utilizing task meta-features and the previous model training experiments performed on related tasks. We adopt a continuous parametrization of the model architecture which allows for efficient gradient-based optimization. Given a new task, an effective architecture is quickly identified by maximizing the estimated performance with respect to the model architecture parameters with simple gradient ascent. It is key to point out that our goal is to achieve reasonable performance at the lowest cost. We provide experimental results showing the effectiveness of the framework despite its high computational efficiency.
View details
Ranking architectures using meta-learning
Alina Dubatovka
Jesse Berent
NeurIPS Workshop on Meta-Learning (MetaLearn 2019) (to appear)
Preview abstract
Neural architecture search has recently attracted lots of research efforts as it promises to automate the manual design of neural networks. However, it requires a large amount of computing resources and in order to alleviate this, a performance prediction network has been recently proposed that enables efficient architecture search by forecasting the performance of candidate architectures, instead of relying on actual model training. The performance predictor is task-aware taking as input not only the candidate architecture but also task meta-features and it has been designed to collectively learn from several tasks. In this work, we introduce a pairwise ranking loss for training a network able to rank candidate architectures for a new unseen task conditioning on its task meta-features. We present experimental results, showing that the ranking network is more effective in architecture search than the previously proposed performance predictor.
View details
Preview abstract
This paper explores the problem of large-scale automatic video geolocation. A methodology is developed to infer the location at which videos from Anonymized.com were recorded using video content and various additional signals. Specifically, multiple binary Adaboost classifiers are trained to identify particular places based on learning decision stumps on sets of hundreds of thousands of sparse features. A one-vs-all classification strategy is then used to classify the location at which videos were recorded. Empirical validation is performed on an immense data set of 20 million labeled videos. Results demonstrate that high accuracy video geolocation is indeed possible for many videos and locations and interesting relationships exist between between videos and the places where they are recorded.
View details
Finding Meaning on YouTube: Tag Recommendation and Category Discovery
Marius Pasca
Computer Vision and Pattern Recognition, IEEE (2010)
Preview abstract
We present a system that automatically recommends tags for YouTube
videos solely based on their audiovisual content. We also propose a novel framework
for unsupervised discovery of video categories that exploits knowledge mined
from the World-Wide Web text documents/searches. First, video content to tag
association is learned by training classifiers that map audiovisual
content-based features from millions of videos on YouTube.com to existing
uploader-supplied tags for these videos. When a new video is uploaded, the
labels provided by these classifiers are used to automatically suggest tags
deemed relevant to the video. Our system has learned a vocabulary of over 20,000 tags.
Secondly, we mined large volumes of Web pages and search queries to discover a
set of possible text entity categories and a set of associated is-A
relationships that map individual text entities to categories. Finally, we
apply these is-A relationships mined from web text on the tags learned from
audiovisual content of videos to automatically synthesize a reliable set of
categories most relevant to videos -- along with a mechanism to predict these
categories for new uploads. We then present rigorous rating studies that
establish that: (a) the average relevance of tags automatically recommended by
our system matches the average relevance of the uploader-supplied tags at the
same or better coverage and (b) the average precision@K of video categories
discovered by our system is 70% with K=5.
View details
Preview abstract
The main feature of this camera is that the pixels have a binary response. The response function
of a gigavision sensor is non-linear and similar to a
logarithmic function, which makes the camera suitable for
high dynamic range imaging. Since the sensor can detect a
single photon, the camera is very sensitive and can be used
for night vision and astronomical imaging.
One important aspect of the gigavision camera is how
to estimate the light intensity through binary observations.
We model the light intensity field as 2D piecewise constant
and use Maximum Penalized Likelihood Estimation
(MPLE) to recover it. Dynamic programming is used to
solve the optimization problem. Due to the complex computation
of dynamic programming, greedy algorithm and
pruning quadtrees are proposed. They show acceptable reconstruction
performance with low computational complexity.
Experimental results with synthesized images and real
images taken by a single-photon avalanche diode (SPAD)
camera are given.
View details
The Gigavision Camera
Edoardo Charbon
Sabine Susstrunk
Martin Vetterli
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (2009)
Preview abstract
We propose a new image device called gigavision camera. The main differences between a conventional and a gigavision camera are that the pixels of the gigavision camera are binary and orders of magnitude smaller. A gigavision camera can be built using standard memory chip technology, where each memory bit is designed to be light sensitive. A conventional gray level image can be obtained from the binary gigavision image by low-pass filtering and sampling. The main advantage of the gigavision camera is that its response is non-linear and similar to a logarithmic function, which makes it suitable for acquiring high dynamic range scenes. The larger the number of binary pixels considered, the higher the dynamic range of the gigavision camera will be. In addition, the binary sensor of the gigavision camera can be combined with a lens array in order to realize an extremely thin camera. Due to the small size of the pixels, this design does not require deconvolution techniques typical of similar systems based on conventional sensors.
View details