Vighnesh Birodkar
I got interested in Computer Vision while trying to make simple robots autonomous during my undergraduate degree. During that time, I also wrote open-source image-processing modules for SimpleCV and scikit-image. My interests led me to my Master's degree at NYU where I initially worked on wavelet-based convolutions for detecting reflection symmetry in images. I started getting interested in deep-learning based methods when I used them to analyze ultrasound images of the human heart. While at NYU, I also worked on unsupervised learning methods for video, by automatically disentangling content and pose from video frames. Before coming to Google, I was working at FeatureX, where I used deep-neural networks to solve a variety of computer vision problems for satellite images. As a researcher, I like to ask a lot of why and how questions. As of now, my biggest fascination is the loss landscape of deep-neural networks and how SGD manages to traverse it. My current research focuses on finding curricula to train neural networks faster. During my residency, I have repeatedly been impressed by the amount of compute and learning resources at Google. I especially like the research environment here because it encourages me to tackle ambitious problems.
I am a huge fan of open-source software and the Python programming language. I love action and science-fiction movies, particularly, how sometimes they can inspire real-world inventions.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Scaling Vision Transformers to 22 Billion Parameters
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Justin Gilmer
Mathilde Caron
Rodolphe Jenatton
Michael Tschannen
Anurag Arnab
Carlos Riquelme
Fisher Yu
Avital Oliver
Fantine Huot
Mark Collier
Yi Tay
Filip Pavetić
Thomas Kipf
Arxiv (2023)
Preview abstract
The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
View details
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Lijun Yu
Xiuye Gu
Rachel Hornung
Hassan Akbari
Ming-Chang Chiu
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Grant Schindler
Huisheng Wang
Jimmy Yan
Xuan Yang
Lu Jiang
arxiv Preprint (2023) (to appear)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
Proper Reuse of Image Classification Features Improves Object Detection
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2022), pp. 13628-13637
Preview abstract
A largely accepted practice in transfer learning is to pre-train a model on a data-abundant upstream task and using the pre-trained weights for model initialization on the downstream task. Specifically, in Object Detection (OD) it is common to initialize the feature backbone with pre-trained ImageNet classifier weights and fine-tune those weights along with the other detection model parameters.
Recent work has shown that this practice is not strictly necessary and that it is possible to train an object detector from scratch by training for much longer.
In this work we investigate the opposite end of the training spectrum and keep the feature backbone frozen during object detection training, preserving the classifier initialization. Contrary to the common belief that object detectors benefit from end-to-end training, we conjecture that the weight initialization obtained from training on a classifier contains useful knowledge that is forgotten by fine-tuning or avoided entirely when training from scratch, with negative consequences for long-tail classes. As an immediate contribution of our findings, we show that it is possible to train an off-the-shelf object detection model with similar if not superior performance while significantly reducing the need for computational resources, both memory-wise and computationally-wise (FLOPs).
The performance benefits of the proposed upstream task knowledge preservation is even more clear when stratifying results by classes and number of annotations available. Our results on MSCOCO, LVIS and Pascal VOC show that our extreme formulation of model reuse has a clear positive impact on full-shot object detection and also on typical hard cases, such as classes with low number of annotations---such as those found in long tail object recognition and few-shot learning.
View details
Less is More: Generating Grounded Navigation Instructions from Landmarks
Jordi Orbay
Izzeddin Gur
Peter Anderson
CVPR (2022) (to appear)
Preview abstract
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.
View details
No Results Found