Alex Olwal
I am a Tech Lead/Manager in Google’s Augmented Reality team and a founder of the Interaction Lab. I direct research and development of interaction technologies based on advancements in display technology, low-power and high-speed sensing, wearables, actuation, electronic textiles, and human—computer interaction. I am passionate about accelerating innovation and disruption through tools, techniques and devices that enable augmentation and empowerment of human abilities. Research interests include augmented reality, ubiquitous computing, mobile devices, 3D user interfaces, interaction techniques, interfaces for accessibility and health, medical imaging, and software/hardware prototyping.
Google I/O 2022 Keynote: Augmented Language
Our Augmented Language project was featured in the I/O 2022 Keynote.
"Let's see what happens when we take our advances in translation and transcription, and deliver them in your line-of-sight", Sundar Pichai, CEO.
· 2020-Now Augmented Reality
· 2018-2020 Google AI: Research & Machine Intelligence
· 2017-2018 ATAP (Advanced Technology and Projects)
· 2016-2017 Wearables, Augmented and Virtual Reality
· 2015-2016 Project Aura, Glass and Beyond
· 2014-2015 Google X
My work is building on my experience from research labs and institutions, including MIT Media Lab, Columbia University, University of California - Santa Barbara, KTH (Royal Institute of Technology), and Microsoft Research. I have taught at Stanford University, Rhode Island School of Design and KTH.
Portfolio: olwal.com
Google I/O 2022 Keynote: Augmented Language
Our Augmented Language project was featured in the I/O 2022 Keynote.
"Let's see what happens when we take our advances in translation and transcription, and deliver them in your line-of-sight", Sundar Pichai, CEO.
· 2020-Now Augmented Reality
· 2018-2020 Google AI: Research & Machine Intelligence
· 2017-2018 ATAP (Advanced Technology and Projects)
· 2016-2017 Wearables, Augmented and Virtual Reality
· 2015-2016 Project Aura, Glass and Beyond
· 2014-2015 Google X
My work is building on my experience from research labs and institutions, including MIT Media Lab, Columbia University, University of California - Santa Barbara, KTH (Royal Institute of Technology), and Microsoft Research. I have taught at Stanford University, Rhode Island School of Design and KTH.
Portfolio: olwal.com
Authored Publications
Sort By
ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
Brian Moreno Collins
Karthik Ramani
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
Preview abstract
Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement.
View details
UI Mobility Control in XR: Switching UI Positionings between Static, Dynamic, and Self Entities
Siyou Pei
Yang Zhang
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 12 (to appear)
Preview abstract
Extended reality (XR) has the potential for seamless user interface (UI) transitions across people, objects, and environments. However, the design space, applications, and common practices of 3D UI transitions remain underexplored. To address this gap, we conducted a need-finding study with 11 participants, identifying and distilling a taxonomy based on three types of UI placements --- affixed to static, dynamic, or self entities. We further surveyed 113 commercial applications to understand the common practices of 3D UI mobility control, where only 6.2% of these applications allowed users to transition UI between entities. In response, we built interaction prototypes to facilitate UI transitions between entities. We report on results from a qualitative user study (N=14) on 3D UI mobility control using our FingerSwitches technique, which suggests that perceived usefulness is affected by types of entities and environments. We aspire to tackle a vital need in UI mobility within XR.
View details
Levels of Multimodal Interaction
Chinmay Kulkarni
ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction (2024)
Preview abstract
Large Multimodal Models (LMMs) like OpenAI's GPT4o and Google's Gemini, introduced in 2024, process multiple modalities, enabling significant advances in multimodal interaction. Inspired by frameworks for self-driving cars and AGI, this paper proposes "Levels of Multimodal Interaction" to guide research and development. The four levels are: basic multimodality (0), single modalities in turn-taking; combined multimodality (1), fused interpretation of multiple modalities; humanlike (2), natural interaction flow with additional communication signals; and beyond humanlike (3), surpassing human capabilities and include underlying hidden signals with the potential for transformational human-AI integration. LMMs have progressed from Level 0 to 1, with Level 2 next.
Level 3 sets a speculative target that multimodal interaction research could help achieve, where interaction becomes more natural and ultimately surpasses human capabilities. Eventually, such Level 3 multimodal interaction could lead to greater human-AI integration and transform human performance. This anticipated shift, in turn, introduces considerations, particularly around safety, agency and control of AI systems.
View details
Experiencing InstructPipe: Building Multi-modal AI Pipelines via Prompting LLMs and Visual Programming
Zhongyi Zhou
Jing Jin
Xiuxiu Yuan
Jun Jiang
Jingtao Zhou
Yiyi Huang
Kristen Wright
Jason Mayes
Mark Sherwood
Ram Iyengar
Na Li
Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 5
Preview abstract
Foundational multi-modal models have democratized AI access, yet the construction of complex, customizable machine learning pipelines by novice users remains a grand challenge. This paper demonstrates a visual programming system that allows novices to rapidly prototype multimodal AI pipelines. We first conducted a formative study with 58 contributors and collected 236 proposals of multimodal AI pipelines that served various practical needs. We then distilled our findings into a design matrix of primitive nodes for prototyping multimodal AI visual programming pipelines, and implemented a system with 65 nodes. To support users' rapid prototyping experience, we built InstructPipe, an AI assistant based on large language models (LLMs) that allows users to generate a pipeline by writing text-based instructions. We believe InstructPipe enhances novice users onboarding experience of visual programming and the controllability of LLMs by offering non-experts a platform to easily update the generation.
View details
Experiencing Thing2Reality: Transforming 2D Content into Conditioned Multiviews and 3D Gaussian Objects for XR Communication
Erzhen Hu
Mingyi Li
Seongkook Heo
Adjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2024)
Preview abstract
During remote communication, participants share both digital and physical content, such as product designs, digital assets, and environments, to enhance mutual understanding. Recent advances in augmented communication have facilitated users to swiftly create and share digital 2D copies of physical objects from video feeds into a shared space. However, the conventional 2D representation of digital objects restricts users’ ability to spatially reference items in a shared immersive environment. To address these challenges, we propose Thing2Reality, an Extended Reality (XR) communication platform designed to enhance spontaneous discussions regard-ing both digital and physical items during remote sessions. WithThing2Reality, users can quickly materialize ideas or physical objects in immersive environments and share them as conditioned multiview renderings or 3D Gaussians. Our system enables users to interact with remote objects or discuss concepts in a collaborative manner.
View details
Experiencing Visual Blocks for ML: Visual Prototyping of AI Pipelines
Na Li
Jing Jin
Michelle Carney
Jun Jiang
Xiuxiu Yuan
Kristen Wright
Mark Sherwood
Jason Mayes
Lin Chen
Jingtao Zhou
Zhongyi Zhou
Ping Yu
Ram Iyengar
ACM (2023) (to appear)
Preview abstract
We demonstrate Visual Blocks for ML, a visual programming platform that facilitates rapid prototyping of ML-based multimedia applications. As the public version of Rapsai , we further integrated large language models and custom APIs into the platform. In this demonstration, we will showcase how to build interactive AI pipelines in a few drag-and-drops, how to perform interactive data augmentation, and how to integrate pipelines into Colabs. In addition, we demonstrate a wide range of community-contributed pipelines in Visual Blocks for ML, covering various aspects including interactive graphics, chains of large language models, computer vision, and multi-modal applications. Finally, we encourage students, designers, and ML practitioners to contribute ML pipelines through https://github.com/google/visualblocks/tree/main/pipelines to inspire creative use cases. Visual Blocks for ML is available at http://visualblocks.withgoogle.com.
View details
InstructPipe: Building Visual Programming Pipelines with Human Instructions
Zhongyi Zhou
Jing Jin
Xiuxiu Yuan
Jun Jiang
Jingtao Zhou
Yiyi Huang
Kristen Wright
Jason Mayes
Mark Sherwood
Ram Iyengar
Na Li
arXiv, 2312.09672 (2023)
Preview abstract
Visual programming provides beginner-level programmers with a coding-free experience to build their customized pipelines. Existing systems require users to build a pipeline entirely from scratch, implying that novice users need to set up and link appropriate nodes all by themselves, starting from a blank workspace. We present InstructPipe, an AI assistant that enables users to start prototyping machine learning (ML) pipelines with text instructions. We designed two LLM modules and a code interpreter to execute our solution. LLM modules generate pseudocode of a target pipeline, and the interpreter renders a pipeline in the node-graph editor for further human-AI collaboration. Technical evaluations reveal that InstructPipe reduces user interactions by 81.1% compared to traditional methods. Our user study (N=16) showed that InstructPipe empowers novice users to streamline their workflow in creating desired ML pipelines, reduce their learning curve, and spark innovative ideas with open-ended commands.
View details
Modeling and Improving Text Stability in Live Captions
Xingyu "Bruce" Liu
Jun Zhang
Leonardo Ferrer
Susan Xu
Vikas Bahirwani
Boris Smus
Extended Abstract of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, 208:1-9
Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions
Xingyu 'Bruce' Liu
Vladimir Kirilyuk
Xiuxiu Yuan
Xiang ‘Anthony’ Chen
Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2023) (to appear)
Experiencing Rapid Prototyping of Machine Learning Based Multimedia Applications in Rapsai
Na Li
Jing Jin
Michelle Carney
Xiuxiu Yuan
Ping Yu
Ram Iyengar
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, ACM, 448:1-4
Preview abstract
We demonstrate Rapsai, a visual programming platform that aims to streamline the rapid and iterative development of end-to-end machine learning (ML)-based multimedia applications. Rapsai features a node-graph editor that enables interactive characterization and visualization of ML model performance, which facilitates the understanding of how the model behaves in different scenarios. Moreover, the platform streamlines end-to-end prototyping by providing interactive data augmentation and model comparison capabilities within a no-coding environment. Our demonstration showcases the versatility of Rapsai through several use cases, including virtual background, visual effects with depth estimation, and audio denoising. The implementation of Rapsai is intended to support ML practitioners in streamlining their workflow, making data-driven decisions, and comprehensively evaluating model behavior with real-world input.
View details