Levels of Multimodal Interaction
Abstract
Large Multimodal Models (LMMs) like OpenAI's GPT4o and Google's Gemini, introduced in 2024, process multiple modalities, enabling significant advances in multimodal interaction. Inspired by frameworks for self-driving cars and AGI, this paper proposes "Levels of Multimodal Interaction" to guide research and development. The four levels are: basic multimodality (0), single modalities in turn-taking; combined multimodality (1), fused interpretation of multiple modalities; humanlike (2), natural interaction flow with additional communication signals; and beyond humanlike (3), surpassing human capabilities and include underlying hidden signals with the potential for transformational human-AI integration. LMMs have progressed from Level 0 to 1, with Level 2 next.
Level 3 sets a speculative target that multimodal interaction research could help achieve, where interaction becomes more natural and ultimately surpasses human capabilities. Eventually, such Level 3 multimodal interaction could lead to greater human-AI integration and transform human performance. This anticipated shift, in turn, introduces considerations, particularly around safety, agency and control of AI systems.
Level 3 sets a speculative target that multimodal interaction research could help achieve, where interaction becomes more natural and ultimately surpasses human capabilities. Eventually, such Level 3 multimodal interaction could lead to greater human-AI integration and transform human performance. This anticipated shift, in turn, introduces considerations, particularly around safety, agency and control of AI systems.