Gang Li

I'm generally interested in natural language processing and text-mining research.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Rich Human Feedback for Text to Image Generation
    Katherine Collins
    Nicholas Carolan
    Yang Li
    Youwei Liang
    Peizhao Li
    Dj Dvijotham
    Junfeng He
    Sarah Young
    Jiao Sun
    Arseniy Klimovskiy
    Preview abstract Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior work collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which keywords in the text prompt are not represented in the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict these rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). View details
    Preview abstract Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen---the focus---as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction. View details
    Preview abstract Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction. View details
    Preview abstract User interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost. View details
    Preview abstract Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a largescale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces. View details
    The Medical Scribe: Corpus Development and Model Performance Analyses
    Amanda Perry
    Ashley Robson Domin
    Chris Co
    Hagen Soltau
    Justin Stuart Paul
    Lauren Keyes
    Linh Tran
    Mark David Knichel
    Mingqiu Wang
    Nan Du
    Rayman Huang
    Proc. Language Resources and Evaluation, 2020
    Preview abstract There has been a growing interest in creating tools to assist clinical note generation from the audio of provider-patient encounters. Motivated by this goal and with the help of providers and experienced medical scribes, we developed an annotation scheme to extract relevant clinical concepts. Using this annotation scheme, a corpus of about 6k clinical encounters was labeled, which was used to train a state-of-the-art tagging model. We report model performance and a detailed analyses of the results. View details
    Human-centric Metric for Accelerating Pathology Reports Annotation
    Ruibin Ma
    Cameron Chen
    Angela Lin
    Krishna Kumar Gadepalli
    Yuannan Cai
    Preview abstract Pathology medical reports written by physicians contain useful class information such as the main organ type, disease type, etc. These class information can be used for large-scale statistical analysis or labelling data in other modalities such as pathology slices (images). However, manual classification for a huge number of reports on multiple tasks are very inefficient. Moreover, they are very hard to read for non-professionals. In this paper, we investigate a general-purpose NLP model called BERT on multilabel text classification. We test it on five different classification tasks and achieve good discrimination. More importantly, we evaluate it under practical situation by measuring how much human labor on annotation can be saved and the performance on automatically classified cases. View details
    Preview abstract Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written text. The application of NER in the context of audio de-identification has yet to be fully investigated. To this end, we define the task of audio de-ID, in which audio spans with entity mentions should be detected. We then present our pipeline for this task, which involves Automatic Speech Recognition (ASR), NER on the transcript text, and text-to-audio alignment. Finally, we introduce a novel metric for audio de-ID and a new evaluation benchmark consisting of a large labeled segment of the Switchboard and Fisher audio datasets and detail our pipeline's results on it. View details
    Preview abstract Motivated by the need to solve a real-world application, we propose a novel model for extracting relationships in tasks where the label space is large but can be factored and the training data is limited. The model tackles the problem in multiple stages but is trained end-to-end using curriculum learning. Each stage realizes simple intuitions for improving the model and through ablation analysis we see the benefits of each stage. We evaluate our models on two tasks, that of extracting symptoms and medications along with their properties from clinical conversations. While LSTM-based baselines achieve a F1-score of 0.08 and 0.35 for symptoms and medications respectively, our models achieve a performance of 0.56 and 0.43 respectively. View details
    No Results Found