Google Research

Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions

Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2023) (to appear)


We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work