Jump to Content
Peggy Chi

Peggy Chi

Peggy Chi is a Staff Research Scientist at Google, leading a team to conduct research and product launches. Her research focuses on developing interactive tools that support users’ creativity activities, including video creation, storytelling, and programming. Peggy received her PhD in Computer Science from UC Berkeley and an MS from the MIT Media Lab. Her research has received a Best Paper Award at ACM CHI, a Google PhD Fellowship in Human-Computer Interaction, a Berkeley Fellowship for Graduate Study, and an MIT Media Lab Fellowship. She has published in top HCI venues and served on program committees, including CHI and UIST. Visit Peggy's personal website.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions
    Xingyu 'Bruce' Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2023) (to appear)
    Preview abstract We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat. View details
    Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
    Xingyu Bruce Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, pp. 1-20
    Preview abstract Computer-mediated platforms are increasingly facilitating verbal communication, and capabilities such as live captioning and noise cancellation enable people to understand each other better. We envision that visual augmentations that leverage semantics in the spoken language could also be helpful to illustrate complex or unfamiliar concepts. To advance our understanding of the interest in such capabilities, we conducted formative research through remote interviews (N=10) and crowdsourced a dataset of 1500 sentence-visual pairs across a wide range of contexts. These insights informed Visual Captions, a real-time system that we integrated into a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We report on our findings from a lab study (N=26) and a two-week deployment study (N=10), which demonstrate how Visual Captions has the potential to help people improve their communication through visual augmentation in various scenarios. View details
    Preview abstract Presentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides. View details
    Synthesis-Assisted Video Prototyping From a Document
    Brian R. Colonna
    Christian Frueh
    UIST 2022: ACM Symposium on User Interface Software and Technology (2022)
    Preview abstract Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video. View details
    Preview abstract Non-linear video editing requires composing footage utilizing visual framing and temporal effects, which can be a time-consuming process. Often, editors borrow effects from existing creation and develop personal editing styles. In this paper, we propose an automatic approach that extracts editing styles in a source video and applies the edits to matched footage for video creation. Our Computer Vision based techniques detects framing, content type, playback speed, and lighting of each input video segment. By applying a combination of these features, we demonstrate an effective method that transfers the visual and temporal styles from professionally edited videos to unseen raw footage. Our experiments with real-world input videos received positive feedback from survey participants. View details
    Automatic Instructional Video Creation from a Markdown-Formatted Tutorial
    Nathan Frey
    UIST 2021: ACM Symposium on User Interface Software and Technology (2021)
    Preview abstract We introduce HowToCut, an automatic approach that converts a Markdown-formatted tutorial into an interactive video that presents the visual instructions with a synthesized voiceover for narration. HowToCut extracts instructional content from a multimedia document that describes a step-by-step procedure. Our method selects and converts text instructions to a voiceover. It makes automatic editing decisions to align the narration with edited visual assets, including step images, videos, and text overlays. We derive our video editing strategies from an analysis of 125 web tutorials and apply Computer Vision techniques to the assets. To enable viewers to interactively navigate the tutorial, HowToCut's conversational UI presents instructions in multiple formats upon user commands. We evaluated our automatically-generated video tutorials through user studies (N=20) and validated the video quality via an online survey (N=93). The evaluation shows that our method was able to effectively create informative and useful instructional videos from a web tutorial document for both reviewing and following. View details
    Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos
    Anh Truong
    Maneesh Agrawala
    CHI 2021: ACM Conference on Human Factors in Computing Systems (2021)
    Preview abstract We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate. View details
    Automatic Video Creation From a Web Page
    Zheng Sun
    UIST 2020: ACM Symposium on User Interface Software and Technology (2020)
    Preview abstract Creating marketing videos from scratch can be challenging, especially when designing for multiple platforms with different viewing criteria. We present URL2Video, an automatic approach that converts a web page into a short video given temporal and visual constraints. URL2Video captures quality materials and design styles extracted from a web page, including fonts, colors, and layouts. Using constraint programming, URL2Video's design engine organizes the visual assets into a sequence of shots and renders to a video with user-specified aspect ratio and duration. Creators can review the video composition, modify constraints, and generate video variation through a user interface. We learned the design process from designers and compared our automatically generated results with their creation through interviews and an online survey. The evaluation shows that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process. View details
    Interactive Visual Description of a Web Page for Smart Speakers
    Conversational User Interfaces Workshop, the ACM CHI Conference on Human Factors in Computing Systems (2020)
    Preview abstract Smart speakers are becoming ubiquitous for accessing lightweight information using speech. While these devices are powerful for question answering and service operations using voice commands, it is challenging to navigate content of rich formats–including web pages–that are consumed by mainstream computing devices. We conducted a comparative study with 12 participants that suggests and motivates the use of a narrative voice output of a web page as being easier to follow and comprehend than a conventional screen reader. We are developing a tool that automatically narrates web documents based on their visual structures with interactive prompts. We discuss the design challenges for a conversational agent to intelligently select content for a more personalized experience, where we hope to contribute to the CUI workshop and form a discussion for future research. View details
    Crowdsourcing Images for Global Diversity
    Matthew Long
    Akshay Gaur
    Abhimanyu Kumar Deora
    Anurag Batra
    Daphne Luong
    MobileHCI 2019: The 21st International Conference on Human Computer Interaction with Mobile Devices and Services (2019)
    Preview abstract Crowdsourcing enables human workers to perform designated tasks unbounded by time and location. As mobile devices and embedded cameras have become widely available, we deployed an image capture task globally for more geographically diverse images. Via our micro-crowdsourcing mobile application, users capture images of surrounding subjects, tag with keywords, and can choose to open source their work. We open-sourced 478,000 images collected from worldwide users as a dataset “Open Images Extended” that aims to add global diversity to imagery training data. We describe our approach and workers’ feedback through survey responses from 171 global contributors to this task. View details
    Doppio: Tracking UI Flows and Code Changes for App Development
    Senpo Hu
    CHI 2018: ACM Conference on Human Factors in Computing Systems
    Preview abstract Developing interactive systems often involves a large set of callback functions for handling user interaction, which makes it challenging to manage UI behaviors, create descriptive documentation, and track revisions. We developed Doppio, a tool that automatically tracks and visualizes UI flows and their changes based on source code elements and their revisions. For each input event listener of a widget, e.g., onClick of an Android View class, Doppio captures and associates its UI output from an execution of the program with its code snippet from the source code. It automatically generates a screenflow diagram that is organized by the callback methods and interaction flow, where developers can review the code and UI revisions interactively. Doppio, implemented as an IDE plugin, is seamlessly integrated into a common development workflow. Our experiments show that Doppio was able to generate quality visual documentation and helped participants understand unfamiliar source code and track changes. View details
    Mobile Crowdsourcing in the Wild: Challenges from a Global Community
    Anurag Batra
    Maxwell Douglas Hsu
    MobileHCI 2018: The 20th International Conference on Human Computer Interaction with Mobile Devices and Services, ACM Press (2018)
    Preview abstract Recent research has been devoted to mobile applications that encourage users to complete microtasks in various context, known as "mobile crowdsourcing." In this case study, we present our ongoing effort of a publicly-available mobile application, Crowdsource, that has over 540,000 global users from 200 countries or regions. Over 15 million sessions have been performed since the first launch in August 2016. We analyze 337 responses from our active users across 27 countries and validate the feedback with a set of usability studies. Our findings suggest design considerations for crowdsourcing microtasks with mobile users at the global scale. View details
    Authoring Illustrations of Human Movements by Iterative Physical Demonstration
    Daniel Vogel
    Mira Dontcheva
    Wilmot Li
    Björn Hartmann
    UIST 2016: ACM Symposium on User Interface Software and Technology (2016)
    Preview abstract Illustrations of human movements are used to communicate ideas and convey instructions in many domains, but creating them is time-consuming and requires skill. We introduce DemoDraw, a multi-modal approach to generate these illustrations as the user physically demonstrates the movements. In a Demonstration Interface, DemoDraw segments speech and 3D joint motion into a sequence of motion segments, each characterized by a key pose and salient joint trajectories. Based on this sequence, a series of illustrations is automatically generated using a stylistically rendered 3D avatar annotated with arrows to convey movements. During demonstration, the user can navigate using speech and amend or re-perform motions if needed. Once a suitable sequence of steps has been created, a Refinement Interface enables fine control of visualization parameters. In a three-part evaluation, we validate the effectiveness of the generated illustrations and the usability of DemoDraw. Our results show 4 to 7-step illustrations can be created in 5 or 10 minutes on average. View details
    Enhancing Cross-Device Interaction Scripting with Interactive Illustrations
    Bjorn Hartmann
    CHI 2016: ACM Conference on Human Factors in Computing Systems
    Preview abstract Cross-device interactions involve input and output on multiple computing devices. Implementing and reasoning about interactions that cover multiple devices with a diversity of form factors and capabilities can be complex. To assist developers in programming cross-device interactions, we created DemoScript, a technique that automatically analyzes a cross-device interaction program while it is being written. DemoScript visually illustrates the step-by-step execution of a selected portion or the entire program with a novel, automatically generated cross-device storyboard visualization. In addition to helping developers understand the behavior of the program, DemoScript also allows developers to revise their program by interactively manipulating the cross-device storyboard. We evaluated DemoScript with 8 professional programmers and found that DemoScript significantly improved development efficiency by helping developers interpret and manage cross-device interaction; it also encourages testing to think through the script in a development process. View details
    Weave: Scripting Cross-Device Wearable Interaction
    CHI 2015: ACM Conference on Human Factors in Computing Systems, ACM, pp. 3923-3932
    Preview abstract Provides a set of high-level APIs, based on JavaScript, and integrated tool support for developers to easily distribute UI output and combine user input and sensing events across devices for cross-device interaction. View details
    No Results Found