Andreas Veit
Research Areas
Authored Publications
Sort By
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
Sadeep Jayasumana
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.
View details
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Sadeep Jayasumana
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
View details
Teacher Guided Training: An Efficient Framework for Knowledge Transfer
Chong You
Himanshu Jain
Rob Fergus
International Conference on Learning Representations (2023) (to appear)
Preview abstract
The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we devise teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pre-trained \emph{generative} models while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the ambient space. Furthermore, we can use the teacher to explore the instance space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. Among our empirical evaluations, we find that TGT can improve accuracy on ImageNet-LT by 10% compared to natural baseline and match accuracy on sentiment analysis on Amazon reviews without the need for pretraining.
View details
RankDistil: Distillation for Ranking
Aditya Krishna Menon
AISTATS 2021 (2021)
Preview abstract
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.
View details
Coping with label shift via distributionally robust optimisation
Jingzhao Zhang
Aditya Krishna Menon
Suvrit Sra
International Conference on Learning Representations (2021)
Preview abstract
The label shift problem refers to the supervised learning setting wherein the train and test label distributions do not match. Existing work on this problem largely assumes access to an unlabelled test sample, which may be used to estimate the test label distribution. While such techniques have proven effective, it is not always feasible to access the target domain; further, this requires retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary shifts from a certain family? In this paper, we propose such a technique based on distributionally robust optimization (DRO) using f-divergences. We design a gradient descent-proximal mirror ascent algorithm tailored for large-scale finite-sum problems to efficiently optimize this objective, and establish its convergence. We show through experiments on CIFAR-100 and ImageNet that our technique can significantly improve performance over a number of baselines in settings where the test label distribution is varied.
View details
Long-tail learning via logit adjustment
Aditya Krishna Menon
Himanshu Jain
Sadeep Jayasumana
International Conference on Learning Representations (ICLR) 2021
Preview abstract
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques involve logit adjustment based on the label priors, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a high relative margin between logits of rare versus dominant labels. Our techniques unify and generalise several recent proposals in the literature, while possessing stronger theoretical guarantees and empirical performance.
View details
Understanding Robustness of Transformers for Image Classification
Daliang Li
Thomas Unterthiner
Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) (to appear)
Preview abstract
Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture such as the use of non-overlapping patches lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
View details
Why are Adaptive Methods Good for Attention Models?
Jingzhao Zhang
Sai Praneeth Karimireddy
Suvrit Sra
Advances in Neural Information Processing Systems (NeurIPS) (2020)
Preview abstract
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an adaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.
View details
Preview abstract
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects. Ideally, the more a network already knows about an image, the better it should be at deciding which layer to compute next. In this work, we propose convolutional networks with adaptive inference graphs (ConvNet-AIG) that adaptively define their network topology conditioned on the input image. Following a high-level structure similar to residual networks (ResNets), ConvNet-AIG decides for each input image on the fly which layers are needed. In experiments on ImageNet we show that ConvNet-AIG learns distinct inference graphs for different categories. Both ConvNet-AIG with 50 and 101 layers outperform their ResNet counterpart, while using 20% and 38% less computations respectively. By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality. Lastly, we also study the effect of adaptive inference graphs on the susceptibility towards adversarial examples. We observe that ConvNet-AIG shows a higher robustness than ResNets, complementing other known defense mechanisms.
View details
Learning From Noisy Large-Scale Datasets With Minimal Supervision
Neil Alldrin
Gal Chechik
Ivan Krasin
Abhinav Gupta
Serge Belongie
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 839-847
Preview abstract
We present an approach to effectively utilize small sets of reliable labels in conjunction with massive datasets of noisy labels to learn powerful image representations. A common approach is to pre-train a network using the large set of noisy labels and fine-tune it using the clean labels. We present an alternative: we use the clean labels to captures the structure in the label space and learn a mapping
between noisy and clean labels. This allows to ”clean the dataset”, and fine-tune the network using both the clean labels and the full dataset with reduced noise. The approach comprises a multi-task network that jointly learns to clean noisy labels and to annotate images with accurate labels. We evaluate our approach using the recently released Open Images dataset, containing ∼ 9 million images with multiple annotations per image. Our results demonstrate that the proposed approach outperforms fine-tuning across all major groups of labels in the Open Image dataset. The approach is particularly effective on the large number of labels with 20-80% label noise.
View details