Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals

Xingyu Bruce Liu; Vladimir Kirilyuk; Xiuxiu Yuan; Alex Olwal; Peggy Chi; Xiang ‘Anthony’ Chen; Ruofei Du

Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals

Xingyu Bruce Liu

Vladimir Kirilyuk

Xiuxiu Yuan

Alex Olwal

Peggy Chi

Xiang ‘Anthony’ Chen

Ruofei Du

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, pp. 1-20

Download Google Scholar

Abstract

Computer-mediated platforms are increasingly facilitating verbal communication, and capabilities such as live captioning and noise cancellation enable people to understand each other better. We envision that visual augmentations that leverage semantics in the spoken language could also be helpful to illustrate complex or unfamiliar concepts. To advance our understanding of the interest in such capabilities, we conducted formative research through remote interviews (N=10) and crowdsourced a dataset of 1500 sentence-visual pairs across a wide range of contexts.

These insights informed Visual Captions, a real-time system that we integrated into a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We report on our findings from a lab study (N=26) and a two-week deployment study (N=10), which demonstrate how Visual Captions has the potential to help people improve their communication through visual augmentation in various scenarios.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs