Meera Hahn

Meera Hahn

I am a research scientist in Google Research, working predominantly at the intersection of computer vision and natural language processing. I joined Google in 2022 after completing my PhD at Georgia Tech. My research interests include embodied AI, text based navigation and localization, text to image and video generation, and general multimodal AI tasks. Check out my homepage for more about me and my research.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512*896 resolution at 8 frames per second. View details
    Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
    James M. Rehg
    The Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Association for Computational Linguistics(2022)
    Preview abstract We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. we develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior work. Our approach outperforms previous baselines. View details