Jason Baldridge
Jason is a research scientist at Google, where he works on natural language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing, including co-creating the Apache OpenNLP Toolkit and OpenCCG. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.
Research Areas
Authored Publications
Sort By
DOCCI: Descriptions of Connected and Contrasting Images
Garrett Tanzer
Jaemin Cho
Su Wang
Sunayana Rane
Zack Berger
Zarana Parekh
(2024)
Preview abstract
Despite recent advancements, text-to-image (T2I) models still exhibit critical limitations, such as errors in understanding spatial relationships, object counting, text rendering, and more. One challenge in overcoming these failure modes is the lack of resources; the majority of existing image-text datasets provide only brief captions that do not offer sufficient detail to discrepancies between images and their descriptions. To advance the development of T2I models further, we introduce \textbf{Descriptions of Connected and Contrasting Images (DOCCI)}, a dataset of 15k images taken by a single person with detailed human-annotated descriptions in English. We meticulously annotated detailed and coherent descriptions, averaging 136 words, which sufficiently differentiate images from related or similar ones. We intentionally curated images that showcase a diverse range of visual properties, including entities with their attributes, various orientations, and lighting effects, many of which are related to each other. We thoroughly analyze the quality and characteristics of the image-description pairs, and assess the performance of the latest T2I and I2T models. The experimental results indicate that the current state-of-the-art T2I models still struggle with the aforementioned challenges, and even the SOTA models have not fully addressed them. DOCCI is publicly available, and we believe that this dataset will be a valuable benchmark for vision-language research.
View details
ImageInWords: Unlocking Hyper-Detailed Image Descriptions
Andrew Bunner
Ranjay Krishna
(2024)
Preview abstract
Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging.
Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process.
We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.
View details
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Yinfei Yang
Zarana Parekh
CVPR (2023)
Preview abstract
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pre-training on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale training on near-human quality synthetic instructions.
View details
Simple and Effective Synthesis of Indoor 3D Scenes
Jing Yu Koh
Harsh Agrawal
Dhruv Batra
Honglak Lee
Yinfei Yang
Peter Anderson
AAAI (2023) (to appear)
Preview abstract
We study the problem of synthesizing immersive 3D indoor scenes from one or a few images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A visionand-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the mature R2R benchmark. Our code is publicly released to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks.
View details
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Su Wang
Chitwan Saharia
Shai Noy
Stefano Pellegrini
Sarah Laszlo
Mohammad Norouzi
Peter Anderson
William Chan
CVPR (2023)
Preview abstract
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
View details
Less is More: Generating Grounded Navigation Instructions from Landmarks
Su Wang
Jordi Orbay
Vighnesh Birodkar
Izzeddin Gur
Peter Anderson
CVPR (2022) (to appear)
Preview abstract
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.
View details
Preview abstract
Questions regarding implicitness, ambiguity and underspecification are crucial for multimodal image+text systems, but have received little attention to date. This paper maps out a conceptual framework to address this gap for systems which generate images from text inputs, specifically for systems which generate images depicting scenes from descriptions of those scenes. In doing so, we account for how texts and images convey different forms of meaning. We then outline a set of core challenges concerning textual and visual ambiguity and specificity tasks, as well as risks that may arise from improper handling of ambiguous and underspecified elements. We propose and discuss two strategies for addressing these challenges: a) generating a visually ambiguous output image, and b) generating a set of diverse output images.
View details
Pathdreamer: A World Model for Indoor Navigation
Jing Yu Koh
Honglak Lee
Yinfei Yang
Peter Anderson
International Conference on Computer Vision (ICCV) 2021 (2021)
Text-to-Image Generation Grounded by Fine-Grained User Attention
Jing Yu Koh
Honglak Lee
Yinfei Yang
IEEE Winter Conference on Applications of Computer Vision (2021) (to appear)
Preview abstract
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.
View details
PanGEA: The Panoramic Graph Environment Annotation Toolkit
Peter Anderson
2nd Workshop on Advances in Language and Vision Research (ALVR) (2021)
Preview abstract
PanGEA, the Panoramic Graph Environment Annotation toolkit, is a lightweight toolkit for collecting speech and text annotations in photo-realistic 3D environments. PanGEA immerses annotators in a web-based simulation and allows them to move around easily as they speak and/or listen. It includes database and cloud storage integration, plus utilities for automatically aligning recorded speech with manual transcriptions and the virtual pose of the annotators. Out of the box, PanGEA supports two tasks -- collecting navigation instructions and navigation instruction following -- and it could be easily adapted for annotating walking tours, finding and labeling landmarks or objects, and similar tasks. We share best practices learned from using PanGEA in a 20,000 hour annotation effort to collect the Room-Across-Room (RxR) dataset. We hope that our open-source annotation toolkit and insights will both expedite future data collection efforts and spur innovation on the kinds of grounded language tasks such environments can support.
View details