Jump to Content
Georg Heigold

Georg Heigold

Georg Heigold received the Diplom degree in physics from ETH Zurich, Switzerland, in 2000. He was a Software Engineer at De La Rue, Berne, Switzerland, from 2000 to 2003. From 2004 to 2010, he was with the Computer Science Department, RWTH Aachen University, Aachen, University. Since 2010, he has been a Research Scientist at Google, Mountain View, CA. His research interests include automatic speech recognition, discriminative training, and log-linear modeling.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Conditional Object-Centric Learning from Video
    Thomas Kipf
    Austin Stone
    Rico Jonschkowski
    Alexey Dosovitskiy
    Klaus Greff
    ICLR, ICLR (2022)
    Preview abstract Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models. View details
    Preview abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train. View details
    Object-Centric Learning with Slot Attention
    Francesco Locatello
    Dirk Weissenborn
    Jakob Uszkoreit
    Alexey Dosovitskiy
    Thomas Kipf
    NeurIPS 2020
    Preview abstract Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks. View details
    End-to-End Text-Dependent Speaker Verification
    Samy Bengio
    Noam M. Shazeer
    International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
    Preview abstract In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system’s components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal ”Ok Google” benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint. View details
    Preview abstract This article proposes and evaluates a Gaussian Mixture Model (GMM) represented as the last layer of a Deep Neural Network (DNN) architecture and jointly optimized with all previous layers using Asynchronous Stochastic Gradient Descent (ASGD). The resulting “Deep GMM” architecture was investigated with special attention to the following issues: (1) The extent to which joint optimization improves over separate optimization of the DNN-based feature extraction layers and the GMM layer; (2) The extent to which depth (measured in number of layers, for a matched total number of parameters) helps a deep generative model based on the GMM layer, compared to a vanilla DNN model; (3) Head-to-head performance of Deep GMM architectures vs. equivalent DNN architectures of comparable depth, using the same optimization criterion (frame-level Cross Entropy (CE)) and optimization method (ASGD); (4) Expanded possibilities for modeling offered by the Deep GMM generative model. The proposed Deep GMMs were found to yield Word Error Rates (WERs) competitive with state-of-the-art DNN systems, at the cost of pre-training using standard DNNs to initialize the Deep GMM feature extraction layers. An extension to Deep Subspace GMMs is described, resulting in additional gains. View details
    GMM-Free DNN Training
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
    Preview
    Word Embeddings for Speech Recognition
    Samy Bengio
    Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech (2014)
    Preview abstract Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate. View details
    Preview abstract We recently showed that Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform state-of-the-art deep neural networks (DNNs) for large scale acoustic modeling where the models were trained with the cross-entropy (CE) criterion. It has also been shown that sequence discriminative training of DNNs initially trained with the CE criterion gives significant improvements. In this paper, we investigate sequence discriminative training of LSTM RNNs in a large scale acoustic modeling task. We train the models in a distributed manner using asynchronous stochastic gradient descent optimization technique. We compare two sequence discriminative criteria -- maximum mutual information and state-level minimum Bayes risk, and we investigate a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function. We obtain significant gains over the CE trained LSTM RNN model using sequence discriminative training techniques. View details
    Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
    Erik McDermott
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
    Preview abstract This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization. View details
    Deep Neural Networks with Auxiliary Gaussian Mixture Models for Real-Time Speech Recognition
    Xin Lei
    Hui Lin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview
    Multiframe Deep Neural Networks for Acoustic Modeling
    Matthieu Devin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations. View details
    Multilingual acoustic models using distributed deep neural networks
    Patrick Nguyen
    Marc'aurelio Ranzato
    Matthieu Devin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). View details
    An Empirical study of learning rates in deep neural networks for speech recognition
    Marc'aurelio Ranzato
    Ke Yang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013) (to appear)
    Preview abstract Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analysis of its limitations, we propose a new variant ‘AdaDec’ that decouples long-term learning-rate scheduling from per-parameter learning rate variation. AdaDec was found to result in higher frame accuracies than other methods. Overall, careful choice of learning rate schemes leads to faster convergence and lower word error rates View details
    Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data
    Patrick Nguyen
    Mitchel Weintraub
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan (2012), pp. 4437-4440
    Preview abstract The acoustic models in state-of-the-art speech recognition systems are based on phones in context that are represented by hidden Markov models. This modeling approach may be limited in that it is hard to incorporate long-span acoustic context. Exemplar-based approaches are an attractive alternative, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented. View details
    WFST Enabled Solutions to ASR Problems: Beyond HMM Decoding
    Björn Hoffmeister
    Ralf Schlüter
    Hermann Ney
    IEEE Transactions on Audio, Speech, and Language Processing, vol. 20 (2012), pp. 551-564
    Latent Log-Linear Models for Handwritten Digit Classification
    Thomas Deselaers
    Tobias Gass
    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34/6 (2012), pp. 1105-1117
    RWTH OCR: A Large Vocabulary Optical Character Recognition System for Arabic Scripts
    Philippe Dreuw
    Hermann Ney
    Guide to OCR for Arabic Scripts, Springer (2012), pp. 215-254
    The RWTH Aachen University Open Source Speech Recognition System
    Christian Gollan
    Björn Hoffmeister
    Jonas Lööf
    Ralf Schlüter
    Hermann Ney
    Interspeech (2009), pp. 2111-2114
    Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition
    Ralf Schlüter
    Hermann Ney
    Interspeech (2009), pp. 216-219
    Speech Recognition with State-based Nearest Neighbour Classifiers.
    Thomas Deselaers
    Hermann Ney
    Interspeech (2007), pp. 2093-2096
    Minimum Exact Word Error Training
    Ralf Schlueter
    Hermann Ney
    Automatic Speech Recognition and Understanding (2005), pp. 186-190