Georg Heigold
Georg Heigold received the Diplom degree in
physics from ETH Zurich, Switzerland, in 2000.
He was a Software Engineer at De La Rue, Berne,
Switzerland, from 2000 to 2003. From 2004 to 2010,
he was with the Computer Science Department,
RWTH Aachen University, Aachen, University.
Since 2010, he has been a Research Scientist at
Google, Mountain View, CA. His research interests
include automatic speech recognition, discriminative
training, and log-linear modeling.
Research Areas
Authored Publications
Sort By
Conditional Object-Centric Learning from Video
Thomas Kipf
Gamaleldin Fathy Elsayed
Austin Stone
Rico Jonschkowski
Alexey Dosovitskiy
Klaus Greff
ICLR, ICLR (2022)
Preview abstract
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
View details
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Dirk Weissenborn
Jakob Uszkoreit
Neil Houlsby
Sylvain Gelly
Thomas Unterthiner
ICLR (2021)
Preview abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train.
View details
Object-Centric Learning with Slot Attention
Francesco Locatello
Dirk Weissenborn
Thomas Unterthiner
Jakob Uszkoreit
Alexey Dosovitskiy
Thomas Kipf
NeurIPS 2020
Preview abstract
Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.
View details
End-to-End Text-Dependent Speaker Verification
Samy Bengio
Noam M. Shazeer
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system’s components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal ”Ok Google” benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint.
View details
Preview abstract
This article proposes and evaluates a Gaussian Mixture Model
(GMM) represented as the last layer of a Deep Neural Network
(DNN) architecture and jointly optimized with all previous layers
using Asynchronous Stochastic Gradient Descent (ASGD). The resulting “Deep GMM” architecture was investigated with special attention
to the following issues: (1) The extent to which joint optimization
improves over separate optimization of the DNN-based
feature extraction layers and the GMM layer; (2) The extent to which
depth (measured in number of layers, for a matched total number
of parameters) helps a deep generative model based on the GMM
layer, compared to a vanilla DNN model; (3) Head-to-head performance
of Deep GMM architectures vs. equivalent DNN architectures
of comparable depth, using the same optimization criterion
(frame-level Cross Entropy (CE)) and optimization method (ASGD);
(4) Expanded possibilities for modeling offered by the Deep GMM
generative model. The proposed Deep GMMs were found to yield
Word Error Rates (WERs) competitive with state-of-the-art DNN
systems, at the cost of pre-training using standard DNNs to initialize
the Deep GMM feature extraction layers. An extension to Deep
Subspace GMMs is described, resulting in additional gains.
View details
Word Embeddings for Speech Recognition
Samy Bengio
Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech (2014)
Preview abstract
Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.
View details
Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks
Andrew Senior
Erik McDermott
Rajat Monga
Mark Mao
Interspeech (2014)
Preview abstract
We recently showed that Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform state-of-the-art deep neural networks (DNNs) for large scale acoustic modeling where the models were trained with the cross-entropy (CE) criterion. It has also been shown that sequence discriminative training of DNNs initially trained with the CE criterion gives significant improvements.
In this paper, we investigate sequence discriminative training of LSTM RNNs in a large scale acoustic modeling task. We train the models in a distributed manner using asynchronous stochastic gradient descent optimization technique. We compare two sequence discriminative criteria -- maximum mutual information and state-level minimum Bayes risk, and we investigate a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function. We obtain significant gains over the CE trained LSTM RNN model using
sequence discriminative training techniques.
View details
GMM-Free DNN Training
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
Asynchronous, Online, GMM-free Training of a Context Dependent Acoustic Model for Speech Recognition
Preview
Proceedings of the European Conference on Speech Communication and Technology (2014) (to appear)