Levels of Multimodal Interaction

Anoop Sinha

Chinmay Kulkarni

Alex Olwal

ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction (2024)

Download Google Scholar

Abstract

Large Multimodal Models (LMMs) like OpenAI's GPT4o and Google's Gemini, introduced in 2024, process multiple modalities, enabling significant advances in multimodal interaction. Inspired by frameworks for self-driving cars and AGI, this paper proposes "Levels of Multimodal Interaction" to guide research and development. The four levels are: basic multimodality (0), single modalities in turn-taking; combined multimodality (1), fused interpretation of multiple modalities; humanlike (2), natural interaction flow with additional communication signals; and beyond humanlike (3), surpassing human capabilities and include underlying hidden signals with the potential for transformational human-AI integration. LMMs have progressed from Level 0 to 1, with Level 2 next.
Level 3 sets a speculative target that multimodal interaction research could help achieve, where interaction becomes more natural and ultimately surpasses human capabilities. Eventually, such Level 3 multimodal interaction could lead to greater human-AI integration and transform human performance. This anticipated shift, in turn, introduces considerations, particularly around safety, agency and control of AI systems.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Levels of Multimodal Interaction

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs