Meera Hahn
I am a research scientist in Google Research, working predominantly at the intersection of computer vision and natural language processing. I joined Google in 2022 after completing my PhD at Georgia Tech. My research interests include embodied AI, text based navigation and localization, text to image and video generation, and general multimodal AI tasks. Check out my homepage for more about me and my research.
Research Areas
Authored Publications
Sort By
Photorealistic Video Generation with Diffusion Models
Agrim Gupta
Kihyuk Sohn
Xiuye Gu
Fei-Fei Li
Lu Jiang
ECCV (2024)
Preview abstract
We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512*896 resolution at 8 frames per second.
View details
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
James M. Rehg
The Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Association for Computational Linguistics (2022)
Preview abstract
We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. we develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior work. Our approach outperforms previous baselines.
View details