Daniel Keysers

Daniel Keysers

Studied Computer Science in Aachen, Germany and Madrid, Spain; PhD in Computer Science (Image Understanding, Pattern Recognition), RWTH Aachen, Germany; PostDoc at German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany; Joined Google Zurich in 2007 as Software Engineer; projects at Google: YouTube Content-ID; Handwriting Recognition; Natural Language Understanding; Deep Learning & Computer Vision Research.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Scaling Vision with Sparse Mixture of Experts
    Carlos Riquelme
    Basil Mustafa
    Maxim Neumann
    Rodolphe Jenatton
    André Susano Pinto
    Neurips 2021.(2021)
    Preview abstract Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet. View details
    Preview abstract Identifying the locations and footprints of buildings is vital for many practical and scientific purposes, and such information can be particularly useful in developing regions where alternative data sources may be scarce. In this work, we describe a model training pipeline for detecting buildings across the entire continent of Africa, given 50cm satellite imagery. Starting with the U-Net model, widely used in satellite image analysis, we study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance. Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances, and further datasets for pre-training and self-training. We report novel methods for improving performance of building detection with this type of model, including the use of mixup (mAP +0.12) and self-training with soft KL loss (mAP +0.06). The resulting pipeline obtains good results even on a wide variety of challenging rural and urban contexts, and was used to create the Open Buildings dataset of approximately 600M Africa-wide building footprints. View details
    Scalable Transfer Learning with Expert Models
    Carlos Riquelme
    Basil Mustafa
    Cedric Renggli
    André Susano Pinto
    Sylvain Gelly
    ICLR 2021(2021)
    Preview abstract Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2–3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases. View details
    Preview abstract Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks with comparable pre-training and inference cost. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. View details
    Fast Multi-language LSTM-based Online Handwriting Recognition
    Thomas Deselaers
    Alexander Daryin
    Marcos Calvo
    Li-Lun Wang
    Sandro Feuz
    Philippe Gervais
    International Journal on Document Analysis and Recognition (IJDAR)(2020)
    Preview abstract Handwriting is a natural input method for many people and we continuously invest in improving the recognition quality. Here we describe and motivate the modelling and design choices that lead to a significant improvement across the 100 supported languages, based on recurrent neural networks and a variety of language models. % This new architecture has completely replaced our previous segment-and-decode system~\cite{Google:HWRPAMI} and reduced the error rate by 30\%-40\% relative for most languages. Further, we report new state-of-the-art results on \iamondb for both the open and closed dataset setting. % By using B\'ezier curves for shortening the input length of our sequences we obtain up to 10x faster recognition times. Through a series of experiments we determine what layers are needed and how wide and deep they should be. % We evaluate the setup on a number of additional public datasets. % View details
    Preview abstract We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet. View details
    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
    Nathanael Schärli
    Nathan Scales
    Hylke Buisman
    Daniel Furrer
    Nikola Momchev
    Danila Sinopalnikov
    Lukasz Stafiniak
    Tibor Tihon
    Dmitry Tsarkov
    ICLR(2020)
    Preview abstract State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings. View details
    Multi-Language Online Handwriting Recognition
    Thomas Deselaers
    Li-Lun Wang
    IEEE Transactions on Pattern Analysis and Machine Intelligence(2016)
    Preview abstract We describe Google's online handwriting recognition system that currently supports 22 scripts and 97 languages. The system's focus is on fast, high-accuracy text entry for mobile, touch-enabled devices. We use a combination of state-of-the-art components and combine them with novel additions in a flexible framework. This architecture allows us to easily transfer improvements between languages and scripts. This made it possible to build recognizers for languages that, to the best of our knowledge, are not handled by any other online handwriting recognition system. The approach also enabled us to use the same architecture both on very powerful machines for recognition in the cloud as well as on mobile devices with more limited computational power by changing some of the settings of the system. In this paper we give a general overview of the system architecture and the novel components, such as unified time- and position-based input interpretation, trainable segmentation, minimum-error rate training for feature combination, and a cascade of pruning strategies. We present experimental results for different setups. The system is currently publicly available in several Google products, for example in Google Translate and as an input method for Android devices. View details
    GyroPen: Gyroscopes for Pen-input with Mobile Phones
    Thomas Deselaers
    Jan Hosang
    IEEE Transactions on Human-Machine Systems, 45(2015), pp. 263-271
    Preview abstract We present GyroPen, a method for text entry into mobile devices using pen-like writing interaction reconstructed from standard built-in sensors. The key idea is to reconstruct a representation of the trajectory of the phone's corner that is touching a writing surface from the measurements obtained from the phone's gyroscopes and accelerometers. We propose to directly use the angular trajectory for this reconstruction, which removes the necessity for accurate absolute 3D position estimation, a task that can be difficult using low-cost accelerometers. Recognition is then performed using an off-the-shelf handwriting recognition system, allowing easy extension to new languages and scripts. In a small user study (n=10), the average novice participant was able to write the first word only 37 seconds after the starting to use GyroPen for the first time. With some experience, users were able to write at the speed of 3-4s for one English word and with a character error rate of 18%. View details
    Features for image retrieval: an experimental comparison
    Thomas Deselaers
    Hermann Ney
    Information Retrieval, 11(2008), pp. 77-107
    Deformation models for image recognition
    Thomas Deselaers
    Christian Gollan
    Hermann Ney
    Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(2007), pp. 1422-1435
    Preview abstract We present the application of different nonlinear image deformation models to the task of image recognition. The deformation models are especially suited for local changes as they often occur in the presence of image object variability. We show that, among the discussed models, there is one approach that combines simplicity of implementation, low-computational complexity, and highly competitive performance across various real-world image recognition tasks. We show experimentally that the model performs very well for four different handwritten digit recognition tasks and for the classification of medical images, thus showing high generalization capacity. In particular, an error rate of 0.54 percent on the MNIST benchmark is achieved, as well as the lowest reported error rate, specifically 12.6 percent, in the 2005 international ImageCLEF evaluation of medical image categorization. View details
    Discriminative Training for Object Recognition using Image Patches
    Thomas Deselaers
    Hermann Ney
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2005)
    Preview abstract We present a method for automatically learning discriminative image patches for the recognition of given object classes. The approach applies discriminative training of log-linear models to image patch histograms. We show that it works well on three tasks and performs significantly better than other methods using the same features. For example, the method decides that patches containing an eye are most important for distinguishing face from background images. The recognition performance is very competitive with error rates presented in other publications. In particular, a new best error rate for the Caltech motorbikes data of 1.5% is achieved. View details
    Adaptation in Statistical Pattern Recognition Using Tangent Vectors
    Hermann Ney
    Joerg Dahmen
    IEEE Trans. Pattern Analysis Machine Intelligence, 26(2004), pp. 269-274
    Preview abstract We integrate the tangent method into a statistical framework for classification analytically and practically. The resulting consistent framework for adaptation allows us to efficiently estimate the tangent vectors representing the variability. The framework improves classification results on two real-world pattern recognition tasks from the domains handwritten character recognition and automatic speech recognition. View details
    Elastic image matching is NP-complete
    Walter Unger
    Pattern Recognition Letters, 24(2003), pp. 445-453
    Preview abstract One fundamental problem in image recognition is to establish the resemblance of two images. This can be done by searching the best pixel to pixel mapping taking into account monotonicity and continuity constraints. We show that this problem is NP-complete by reduction from 3-SAT, thus giving evidence that the known exponential time algorithms are justified, but approximation algorithms or simplifications are necessary. View details
    Maximum Entropy and Gaussian Models for Image Object Recognition
    Franz Josef Och
    Hermann Ney
    DAGM-Symposium(2002), pp. 498-506
    Improving Automatic Speech Recognition Using Tangent Distance
    Joerg Dahmen
    Hermann Ney
    European Conference on Speech Communication and Technology(2001)
    Learning of Variability for Invariant Statistical Pattern Recognition
    Joerg Dahmen
    Hermann Ney
    European Conference on Machine Learning (ECML)(2001), pp. 263-275