Marvin Ritter
Research Areas
Authored Publications
Sort By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Carlos Riquelme
Sebastian Goodman
Yi Tay
Siamak Shakeri
Daniel Salz
Michael Tschannen
Mandar Joshi
Filip Pavetić
Gang Li
Anurag Arnab
Yuanzhong Xu
Keran Rong
Neil Houlsby
Computer Vision and Pattern Recognition Conference (CVPR) (2024)
Preview abstract
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
View details
Representation learning from videos in-the-wild: An object-centric approach
Rob Romijnders
Michael Tschannen
Josip Djolonga
Neil Houlsby
WACV (2021)
Preview abstract
We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video. We report competitive results on 19 transfer learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8 out-of-distribution-generalization tasks, and discuss the benefits and shortcomings of the proposed approach. In particular, it improves over the baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution generalization tasks. Finally, we perform several ablation studies and analyze the impact of the pretrained object detector on the performance across this suite of tasks.
View details
Continental-scale building detection from high resolution satellite imagery
Wojciech Sirko
Yasser Salah Eddine Bouchareb
Maxim Neumann
Moustapha Cisse
arXiv (2021)
Preview abstract
Identifying the locations and footprints of buildings is vital for many practical and scientific purposes, and such information can be particularly useful in developing regions where alternative data sources may be scarce. In this work, we describe a model training pipeline for detecting buildings across the entire continent of Africa, given 50cm satellite imagery. Starting with the U-Net model, widely used in satellite image analysis, we study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance. Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances, and further datasets for pre-training and self-training. We report novel methods for improving performance of building detection with this type of model, including the use of mixup (mAP +0.12) and self-training with soft KL loss (mAP +0.06). The resulting pipeline obtains good results even on a wide variety of challenging rural and urban contexts, and was used to create the Open Buildings dataset of approximately 600M Africa-wide building footprints.
View details
Self-Supervised Learning of Video-Induced Visual Invariances
Michael Tobias Tschannen
Josip Djolonga
Neil Houlsby
Sylvain Gelly
Conference on Computer Vision and Pattern Recognition (2020)
Preview abstract
We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We make use of the natural hierarchy consisting of (i) frame level invariances (e.g. color and contrast robustness), (ii) shot/clip level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video level invariances (semantic relationships of scenes across shots/clips) to define a holistic self-supervised loss. We train the proposed model on the YouTube-8M dataset and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB). We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with $10x$ fewer labeled images.
View details
High-Fidelity Image Generation With Fewer Labels
Michael Tschannen
Sylvain Gelly
International Conference on Machine Learning (2019)
Preview abstract
Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial net-works has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform state-of-the-art on both unsupervised ImageNet synthesis,as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the art conditional model BigGAN on ImageNet using only 10% of the labels and outperform it using 20% of the labels.
View details
Self-Supervised GANs via Auxiliary Rotation Loss
Ting Chen
Neil Houlsby
Conference on Computer Vision and Pattern Recognition (2018)
Preview abstract
Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labelled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, to close the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game. The role of self-supervision is to encourage the discriminator to learn meaningful feature representations which are not forgotten during training. We test empirically both the quality of the learned image representations, and the quality of the synthesized images. Under the same conditions, the self-supervised GAN attains a similar performance to state-of-the-art conditional counterparts. Finally, we show that this approach to fully unsupervised learning can be scaled to attain an FID of 33 on unconditional ImageNet generation.
View details
Audio Set: An ontology and human-labeled dataset for audio events
Jort F. Gemmeke
Dylan Freedman
Wade Lawrence
Proc. IEEE ICASSP 2017, New Orleans, LA (to appear)
Preview abstract
Audio event recognition, the human-like ability to identify and relate sounds
from audio, is a nascent problem in machine perception. Comparable problems such
as object detection in images have reaped enormous benefits from comprehensive
datasets -- principally ImageNet. This paper describes the creation of
Audio Set, a large-scale dataset of manually-annotated audio events that
endeavors to bridge the gap in data availability between image and audio
research. Using a carefully structured hierarchical ontology of 635 audio
classes guided by the literature and manual curation, we collect data from human
labelers to probe the presence of specific audio classes in 10 second segments
of YouTube videos. Segments are proposed for labeling using searches based on
metadata, context (e.g., links), and content analysis. The
result is a dataset of unprecedented breadth and size that will, we hope,
substantially stimulate the development of high-performance audio event
recognizers.
View details
Now Playing: Continuous low-power music recognition
Dominik Roblek
James David Lyon
Julian James Odell
Mihajlo Velimirović
NIPS 2017 Workshop: Machine Learning on the Phone
Preview abstract
Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs.
View details