Su Wang
I am a Software Engineer with Google AI Language. My fields of research are in Natural Language Processing (NLP), Natural Language Generation (NLG), with a broad interest in topics related to Machine Learning (ML).
I graduated with a doctorate on Computational Linguistics from The University of Texas at Austin, advised by Katrin Erk and Greg Durrett. During grad studies I focused on Narrative Understanding and Text Generation, and specialized in probabilistic methods and neural network modeling approach.
Currently I am with the Earthsea team, supervised by Jason Baldridge (jasonbaldridge@), working closely with Austin Waters (austinwaters@), Peter Anderson (pjand@), Ming Zhao (astroming@) and Alex Ku (alexku@) on Vision-Language Navigation (specifically on instruction generation for VLN).
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Jing Yu Koh
Yinfei Yang
Zarana Parekh
CVPR (2023)
Preview abstract
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pre-training on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale training on near-human quality synthetic instructions.
View details
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Chitwan Saharia
Shai Noy
Stefano Pellegrini
Sarah Laszlo
Mohammad Norouzi
Peter Anderson
William Chan
CVPR (2023)
Preview abstract
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
View details
Less is More: Generating Grounded Navigation Instructions from Landmarks
Jordi Orbay
Izzeddin Gur
Peter Anderson
CVPR (2022) (to appear)
Preview abstract
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.
View details
On the Evaluation of Vision-and-Language Navigation Instructions
Ming Zhao
Peter Anderson
Vihan Jain
Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021)
No Results Found