Srikumar Ramalingam
Research Areas
Authored Publications
Sort By
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
Sadeep Jayasumana
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.
View details
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Sadeep Jayasumana
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
View details
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation with weak supervision
Kieran Alexander Murphy
Varun Jampani
Computer vision and pattern recognition (CVPR) 2022 (to appear)
Preview abstract
Representational learning forms the backbone of most deep learning applications, and the value of a learned representation depends on its information content about the different factors of variation.
Learning good representations is intimately tied to the nature of supervision and the learning algorithm.
We propose a novel algorithm that relies on a weak form of supervision where the data is partitioned into sets according to certain \textit{inactive} factors of variation.
Our key insight is that by seeking approximate correspondence between elements of different sets, we learn strong representations that exclude the inactive factors of variation and isolate the \textit{active} factors which vary within all sets.
We demonstrate that the method can work in a semi-supervised scenario, and that a portion of the unsupervised data can belong to a different domain entirely, as long as the same active factors of variation are present.
By folding in data augmentation to suppress additional nuisance factors, we are able to further control the content of the learned representations.
We outperform competing baselines on the challenging problem of synthetic-to-real object pose transfer.
View details
Preview abstract
Single image pose estimation is a fundamental problem in many vision
and robotics tasks, and existing deep learning approaches suffer by
not completely modeling and handling: i) uncertainty about the
predictions, and ii) symmetric objects with multiple (sometimes
infinite) correct poses. To this end, we introduce a method to
estimate arbitrary, non-parametric distributions on SO(3). Our key
idea is to represent the distributions implicitly, with a neural
network that estimates the probability given the input image and a
candidate pose. Grid sampling or gradient ascent can be used to find
the most likely pose, but it is also possible to evaluate the
probability at any pose, enabling reasoning about symmetries and
uncertainty. This is the most general way of representing
distributions on manifolds, and to showcase the rich expressive power,
we introduce a dataset of challenging symmetric and nearly-symmetric
objects. We require no supervision on pose uncertainty – the model
trains only with a single pose per example. Nonetheless, our implicit
model is highly expressive to handle complex distributions over 3D
poses, while still obtaining accurate pose estimation on standard
non-ambiguous environments, achieving state-of-the-art performance
on Pascal3D+ and ModelNet10-SO(3) benchmarks. Code, data, and
visualizations may be found at implicit-pdf.github.io.
View details
Preview abstract
It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between sensitivity and robustness to natural (non-adversarial) perturbations by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise on both vision and NLP datasets.
View details