play silent looping video pause silent looping video

Augmented object intelligence with XR-Objects

October 1, 2024

M. Doga Dogan, Student Researcher, and David Kim, Perceptive Experiences Lead, Google Augmented Reality

XR-Objects is an innovative augmented reality research prototype system that transforms physical objects into interactive digital portals using real-time object segmentation and multimodal large language models (MLLMs).

Recent spatial computing and artificial intelligence (AI) advancements have paved the way for more immersive and interactive experiences in extended reality (XR). Yet, integrating physical objects into digital environments is still challenging. Typically, the real world is treated as a passive backdrop in XR, while digital objects hold most of the interactive capabilities. What if physical objects could be seamlessly merged with digital entities in real-time? What if just as easily as we right-click a file on our computer, we could tap on a real object to get relevant information and take digital actions?

In “Augmented Object Intelligence with XR-Objects”, presented at 2024 ACM UIST, we introduce Augmented Object Intelligence (AOI), a new interaction paradigm that aims to make physical objects digitally interactive in XR environments. AOI enables real-world objects to serve as portals to digital functionalities, without requiring pre-registration or manual setup. By leveraging real-time object segmentation and classification, coupled with multimodal large language models (MLLMs), AOI transforms everyday objects into interactive tools within XR.

We showcase this concept in the form of XR-Objects, an open-source prototype system that allows users to engage with physical environments through context-aware object-based menus. For example, a user can select a physical object like a pot of pasta and instantly access relevant information, like cooking times, or set a timer visually anchored to the pot in 3D space. Here, analog objects aren’t only conveying information but can trigger digital actions. Our work highlights the versatility of XR-Objects across a variety of use cases and presents a user study demonstrating the system’s efficacy.

play silent looping video pause silent looping video

When a user interacts with an object, XR-Objects presents them with generated context menus that display related information, allows comparison with other objects, or spatially anchors interactive UI widgets to objects in 3D spaces.

XR-Objects implementation

XR-Objects leverages recent developments in spatial understanding to implement augmented reality (AR) interactions with semantic depth. We utilize COCO, via MediaPipe, for object segmentation and classification, while we use simultaneous localization and mapping, provided through ARCore, to track and localize objects in 3D space.

Many consumer XR headsets restrict access to real-time camera streams. To allow us and the research community to freely explore AR scenarios relying on object detection without these limitations, we developed a mobile prototype. Modern smartphones have ARM-based chipsets similar to XR headsets, offering comparable performance for real-time computer vision tasks. With ARCore, our app can localize objects and overlay digital information onto the physical world using a phone's high-resolution camera. We also integrate the PaLI MLLM and a speech recognizer to further enhance our ability to automate the recognition and retrieval of specific object information within XR spaces. By integrating voice and visual inputs, our novel system offers a seamless and familiar interface for users to engage with their surroundings.

Design considerations

Our design focused on enhancing user interaction and system performance by seamlessly integrating digital functionalities with physical objects.

Object-Centric vs. App-Centric Interaction: Traditional AR systems use app-centric models requiring users to navigate through apps to access digital functions. XR-Objects takes an object-centric approach that allows direct interaction with objects to create a more natural, immersive experience, presenting the physical world with a digital interface. Although the current prototype is an app, we aim for future native integration, similar to built-in QR scanning on smartphones.

World-Space vs. Screen-Space UI: We adopted a world-space UI, where digital elements are anchored to physical objects, maintaining spatial consistency. This helps minimize cognitive load and preserves the immersive nature of XR interactions, unlike screen-space UIs that detach digital content from the physical context.

Signaling XR-Objects: We use semi-transparent "bubbles" to indicate interactable objects, reducing visual clutter and guiding users through a clean and intuitive interface.

Fixed Categories and Actions: A radial menu provides a limited number of top-level actions to enhance quick decision-making and reduce decision fatigue.

Categories of actions

AOI facilitates fluid user interactions with single or multiple objects and enables various digital actions, such as querying real-time information, asking questions, sharing objects, or adding spatial notes. Inspired by sub-menus in traditional desktop computing, we categorized seven actions into four categories:

  1. Information: provide an overview; ask a question
  2. Compare: ask to compare multiple objects within the view
  3. Share: send object to a contact; add to shopping list
  4. Anchor: notes; timer; countdown

Information and Compare categories represent traditional visual question answering (VQA) tasks, while Share and Anchor categories represent traditional widget tasks.

System architecture

The implementation of XR-Objects involves four steps: (1) detecting objects, (2) localizing and anchoring onto objects, (3) coupling each object with an MLLM for metadata retrieval, and (4) executing actions and displaying the output in response to user input. We use Unity and its AR Foundation to bring these together to build a system that augments real-world objects with functional context menus.

Object detection: XR-Objects uses an object detection module powered by MediaPipe, and leverages a mobile-optimized convolutional neural network for real-time classification. The system detects objects, assigning them class labels (e.g., “bottle,” “monitor”) and generating 2D bounding boxes to serve as spatial anchors for AR content. It recognizes 80 object types originating in the COCO dataset. To prioritize privacy and data efficiency, only relevant object regions are processed, excluding, for example, people detected in a scene.

Localization and anchoring: Once an object is detected, XR-Objects anchors AR menus using 2D bounding boxes and depth data, converting them into precise 3D coordinates via raycasting. A semi-transparent "bubble" signals interactables, and the full menu appears only when tapped, reducing visual clutter. Safeguards ensure accurate placement without duplication.

MLLM coupling: Each object is paired with an MLLM session, which analyzes a cropped image to provide detailed information, like product specs or reviews. For instance, it can identify a "bottle" as "Superior dark soy sauce" and retrieve metadata, e.g., prices or ratings, using PaLI.

XR-Objects_pipeline

The XR-Objects processing pipeline combines MediaPipe and ARCore for object detection and spatial tracking, respectively, integrates an MLLM for object-specific metadata retrieval and interaction, and renders UI content in 3D spaces.

User commands: XR-Objects supports multimodal touch or voice interactions. For voice commands, a speech recognition engine processes queries that are then reflected in an overlaid panel on the object. Users can retrieve real-time object information or ask specific questions, with responses generated by the MLLM.

Comparisons: For object comparisons, XR-Objects allows multiple objects to be stitched together into a single query. The MLLM processes these and outputs a combined comparison, which is displayed to the user for easy interaction and decision-making.

XR-Objects_user_prompt

XR-Objects instantiates a dedicated MLLM instance for identified objects in the scene. Object comparisons are executed by stitching together the relevant objects in the scene before passing the query to an MLLM instance.

Evaluation

We conducted a user study comparing the interaction flow of using XR-Objects to a state-of-the-art MLLM assistant interface (Gemini app), referred to as the "baseline" from here on, for object-centric tasks in simulated grocery shopping and at-home environments. Eight participants completed six tasks using both systems, and their task completion time and qualitative feedback were recorded.

XR-Objects_F6study-setup

User-study setup with mock grocery store (a) and at-home (b) environments. Examples of using XR-Objects in each case are shown in (c) and (d), respectively.

Task completion time: On average, participants using XR-Objects completed tasks significantly faster (M=217.5s, SD=58s) compared to the baseline (M=286.3s, SD=71s), confirmed by a paired t-test (t=-2.8, df=5, p=0.01). This represents a 31% reduction in task completion time, highlighting the efficiency of XR-Objects for real-time object interaction.

HALIE survey: After completing tasks with both systems, participants filled out a survey adapted from the HALIE framework, which measures ease of use, satisfaction, and overall effectiveness in AI-driven human-computer interactions.

Participants rated both systems positively, but XR-Objects showed more consistent ease of use (skewness γ₁ = 0.03) compared to the baseline (γ₁ = 2.25), indicating a smoother experience across users. Satisfaction ratings were similar, but XR-Objects held a slight advantage in overall usability.

XR-Objects_F8

Likert-scale results of the HALIE survey.

Form factor: A post-study survey indicated a clear preference for using XR-Objects on an head-mounted display (HMD) (F(191, 179) = 1.917, p < 7.05e−08), validating its potential for more immersive environments. When using a smartphone, preferences were split between XR-Objects and the baseline, suggesting that the tool’s full potential is realized in a head-mounted display context.

Qualitative feedback: Participants found XR-Objects intuitive and efficient. Compared to the baseline, users highlighted its ability to complete tasks more quickly. The evaluation demonstrated that XR-Objects offers significant improvements in task efficiency and user experience, particularly when integrated into future HMD platforms.

Applications

XR-Objects can enhance everyday interactions by enabling digital functions for analog objects. This expands their utility, for example, turning a pot into a cooking timer or comparing nutritional information of different products.

Cooking

XR-Objects integrates digital intelligence directly into the kitchen, recognizing ingredients and providing details like nutritional facts or recipes. Users can interact via voice or touch, setting timers or asking for comparisons. This system is especially useful for multi-step tasks like cooking or mechanical fixing.

XR-Objects_F1

XR-Objects allows users to (a) select and interact with real-world objects. Automatically generated object-based AR context menus (b) attach information to objects in the scene, such as nutritional facts and ingredients. For example, a user (c) asks about the cooking time of pasta, and then (d) uses the answer to set a spatial timer widget anchored to the relevant pot.

Shopping

In stores, XR-Objects can assist users by providing details like prices, calorie info, or product comparisons. It could even translate labels and offer personalized recommendations based on user preferences.

XR-Objects_ShoppingExample

Real-time spatial assistance in selecting the appropriate laundry detergent.

Discovery

XR-Objects helps users discover information about their surroundings by pointing a device at objects, revealing details like names or care instructions, transforming overlooked objects into learning opportunities.

XR-Objects_F11discover

Discovering plant species in the environment through spatial, on-demand MLLM queries.

Conclusion

We introduce XR-Objects, an Augmented Object Intelligence (AOI) that blends physical objects with digital functions through advancements in AR and AI. User studies show improved task efficiency and satisfaction, promising applications in cooking, productivity, and connectivity for a more immersive interaction between physical and digital worlds.

We have open-sourced the code for XR-Objects to encourage further community exploration and development. With this release, we aim to foster a new wave of innovation in XR, bringing the digital and physical worlds closer. Try it out today and explore a future where real-world interactions are enhanced through AI-driven augmentation.

Acknowledgments

This research was conducted by Mustafa Doga Dogan, Eric J. Gonzalez, Andrea Colaco, Karan Ahuja, Ruofei Du, Johnny Lee, Mar Gonzalez-Franco, and David Kim. The last two co-authors share co-senior authorship. We want to thank Guru Somadder and Adarsh Kowdle for providing valuable assistance for the blog post.