Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions

Xingyu 'Bruce' Liu; Vladimir Kirilyuk; Xiuxiu Yuan; Peggy Chi; Alex Olwal; Xiang ‘Anthony’ Chen; Ruofei Du

Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions

Xingyu 'Bruce' Liu

Vladimir Kirilyuk

Xiuxiu Yuan

Peggy Chi

Alex Olwal

Xiang ‘Anthony’ Chen

Ruofei Du

Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2023) (to appear)

Download Google Scholar

Abstract

We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs