Peggy Chi

Peggy Chi

Peggy Chi is a Staff Research Scientist at Google, leading a team to conduct research and product launches. Her research focuses on developing interactive tools that support users’ creativity activities, including video creation, storytelling, and programming. Peggy received her PhD in Computer Science from UC Berkeley and an MS from the MIT Media Lab. Her research has received a Best Paper Award at ACM CHI, a Google PhD Fellowship in Human-Computer Interaction, a Berkeley Fellowship for Graduate Study, and an MIT Media Lab Fellowship. She has published in top HCI venues and served on program committees, including CHI and UIST. Visit Peggy's personal website.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions
    Xingyu 'Bruce' Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM(2023) (to appear)
    Preview abstract We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat. View details
    Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
    Xingyu Bruce Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, pp. 1-20
    Preview abstract Computer-mediated platforms are increasingly facilitating verbal communication, and capabilities such as live captioning and noise cancellation enable people to understand each other better. We envision that visual augmentations that leverage semantics in the spoken language could also be helpful to illustrate complex or unfamiliar concepts. To advance our understanding of the interest in such capabilities, we conducted formative research through remote interviews (N=10) and crowdsourced a dataset of 1500 sentence-visual pairs across a wide range of contexts. These insights informed Visual Captions, a real-time system that we integrated into a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We report on our findings from a lab study (N=26) and a two-week deployment study (N=10), which demonstrate how Visual Captions has the potential to help people improve their communication through visual augmentation in various scenarios. View details
    Preview abstract Presentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides. View details
    Synthesis-Assisted Video Prototyping From a Document
    Brian R. Colonna
    Christian Frueh
    UIST 2022: ACM Symposium on User Interface Software and Technology(2022)
    Preview abstract Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video. View details
    Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos
    Anh Truong
    Maneesh Agrawala
    CHI 2021: ACM Conference on Human Factors in Computing Systems(2021)
    Preview abstract We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate. View details
    Preview abstract Non-linear video editing requires composing footage utilizing visual framing and temporal effects, which can be a time-consuming process. Often, editors borrow effects from existing creation and develop personal editing styles. In this paper, we propose an automatic approach that extracts editing styles in a source video and applies the edits to matched footage for video creation. Our Computer Vision based techniques detects framing, content type, playback speed, and lighting of each input video segment. By applying a combination of these features, we demonstrate an effective method that transfers the visual and temporal styles from professionally edited videos to unseen raw footage. Our experiments with real-world input videos received positive feedback from survey participants. View details
    Automatic Instructional Video Creation from a Markdown-Formatted Tutorial
    Nathan Frey
    UIST 2021: ACM Symposium on User Interface Software and Technology(2021)
    Preview abstract We introduce HowToCut, an automatic approach that converts a Markdown-formatted tutorial into an interactive video that presents the visual instructions with a synthesized voiceover for narration. HowToCut extracts instructional content from a multimedia document that describes a step-by-step procedure. Our method selects and converts text instructions to a voiceover. It makes automatic editing decisions to align the narration with edited visual assets, including step images, videos, and text overlays. We derive our video editing strategies from an analysis of 125 web tutorials and apply Computer Vision techniques to the assets. To enable viewers to interactively navigate the tutorial, HowToCut's conversational UI presents instructions in multiple formats upon user commands. We evaluated our automatically-generated video tutorials through user studies (N=20) and validated the video quality via an online survey (N=93). The evaluation shows that our method was able to effectively create informative and useful instructional videos from a web tutorial document for both reviewing and following. View details
    Automatic Video Creation From a Web Page
    Zheng Sun
    UIST 2020: ACM Symposium on User Interface Software and Technology(2020)
    Preview abstract Creating marketing videos from scratch can be challenging, especially when designing for multiple platforms with different viewing criteria. We present URL2Video, an automatic approach that converts a web page into a short video given temporal and visual constraints. URL2Video captures quality materials and design styles extracted from a web page, including fonts, colors, and layouts. Using constraint programming, URL2Video's design engine organizes the visual assets into a sequence of shots and renders to a video with user-specified aspect ratio and duration. Creators can review the video composition, modify constraints, and generate video variation through a user interface. We learned the design process from designers and compared our automatically generated results with their creation through interviews and an online survey. The evaluation shows that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process. View details
    Interactive Visual Description of a Web Page for Smart Speakers
    Conversational User Interfaces Workshop, the ACM CHI Conference on Human Factors in Computing Systems(2020)
    Preview abstract Smart speakers are becoming ubiquitous for accessing lightweight information using speech. While these devices are powerful for question answering and service operations using voice commands, it is challenging to navigate content of rich formats–including web pages–that are consumed by mainstream computing devices. We conducted a comparative study with 12 participants that suggests and motivates the use of a narrative voice output of a web page as being easier to follow and comprehend than a conventional screen reader. We are developing a tool that automatically narrates web documents based on their visual structures with interactive prompts. We discuss the design challenges for a conversational agent to intelligently select content for a more personalized experience, where we hope to contribute to the CUI workshop and form a discussion for future research. View details
    Crowdsourcing Images for Global Diversity
    Matthew Long
    Akshay Gaur
    Abhimanyu Kumar Deora
    Anurag Batra
    Daphne Luong
    MobileHCI 2019: The 21st International Conference on Human Computer Interaction with Mobile Devices and Services(2019)
    Preview abstract Crowdsourcing enables human workers to perform designated tasks unbounded by time and location. As mobile devices and embedded cameras have become widely available, we deployed an image capture task globally for more geographically diverse images. Via our micro-crowdsourcing mobile application, users capture images of surrounding subjects, tag with keywords, and can choose to open source their work. We open-sourced 478,000 images collected from worldwide users as a dataset “Open Images Extended” that aims to add global diversity to imagery training data. We describe our approach and workers’ feedback through survey responses from 171 global contributors to this task. View details