Ruofei Du

Ruofei Du

Ruofei Du serves as Interactive Perception & Graphics Lead / Manager at Google and devotes to creating novel interactive technologies for XR. As a Research Scientist, Ruofei's research covers a wide range of topics in technical HCI, Graphics, and Perception, including XR interactions, visual programming, augmented communication, XR social platforms, digital human, foveated rendering, accessibility, and deep learning in graphics. Du serves as an Associate Chair in program committee of CHI and UIST; an Associate Editor of IEEE TCSVT. He holds 3 US patents and has published over 30 peer-reviewed publications in top venues of HCI, Computer Graphics, and Computer Vision, including CHI, UIST, SIGGRAPH Asia, TVCG, CVPR, ICCV, ECCV, ISMAR, VR, and I3D. In his own words: I am passionate about inventing interactive technologies with graphics, perception, and HCI. See my research, artsy, projects, youtube, talks, github, and shadertoy demos for fun!


Personal Website
Google Scholar
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    UI Mobility Control in XR: Switching UI Positionings between Static, Dynamic, and Self Entities
    Siyou Pei
    Yang Zhang
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 12 (to appear)
    Preview abstract Extended reality (XR) has the potential for seamless user interface (UI) transitions across people, objects, and environments. However, the design space, applications, and common practices of 3D UI transitions remain underexplored. To address this gap, we conducted a need-finding study with 11 participants, identifying and distilling a taxonomy based on three types of UI placements --- affixed to static, dynamic, or self entities. We further surveyed 113 commercial applications to understand the common practices of 3D UI mobility control, where only 6.2% of these applications allowed users to transition UI between entities. In response, we built interaction prototypes to facilitate UI transitions between entities. We report on results from a qualitative user study (N=14) on 3D UI mobility control using our FingerSwitches technique, which suggests that perceived usefulness is affected by types of entities and environments. We aspire to tackle a vital need in UI mobility within XR. View details
    Augmented Object Intelligence with XR-Objects
    Mustafa Doga Dogan
    Karan Ahuja
    Andrea Colaco
    Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2024), pp. 1-15
    Preview abstract Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing. This paper explores Augmented Object Intelligence (AOI) in the context of XR, an interaction paradigm that aims to blur the lines between digital and physical by equipping real-world objects with the ability to interact as if they were digital, where every object has the potential to serve as a portal to digital functionalities. Our approach utilizes real-time object segmentation and classification, combined with the power of Multimodal Large Language Models (MLLMs), to facilitate these interactions without the need for object pre-registration. We implement the AOI concept in the form of XR-Objects, an open-source prototype system that provides a platform for users to engage with their physical environment in contextually relevant ways using object-based context menus. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. Our contributions are threefold: (1) we define the AOI concept and detail its advantages over traditional AI assistants, (2) detail the XR-Objects system’s open-source design and implementation, and (3) show its versatility through various use cases and a user study. View details
    FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces
    Safa C. Medin
    Gengyan Li
    Stephan Garbin
    Philip Davidson
    Gregory W. Wornell
    Thabo Beeler
    Abhimitra Meka
    Proceedings of the ACM on Computer Graphics and Interactive Techniques, 7 (2024), pp. 1-17
    Preview abstract 3D rendering of dynamic face captures is a challenging problem, and it demands improvements on several fronts---photorealism, efficiency, compatibility, and configurability. We present a novel representation that enables high-quality volumetric rendering of an actor's dynamic facial performances with minimal compute and memory footprint. It runs natively on commodity graphics soft- and hardware, and allows for a graceful trade-off between quality and efficiency. Our method utilizes recent advances in neural rendering, particularly learning discrete radiance manifolds to sparsely sample the scene to model volumetric effects. We achieve efficient modeling by learning a single set of manifolds for the entire dynamic sequence, while implicitly modeling appearance changes as temporal canonical texture. We export a single layered mesh and view-independent RGBA texture video that is compatible with legacy graphics renderers without additional ML integration. We demonstrate our method by rendering dynamic face captures of real actors in a game engine, at comparable photorealism to state-of-the-art neural rendering techniques at previously unseen frame rates. View details
    Experiencing Thing2Reality: Transforming 2D Content into Conditioned Multiviews and 3D Gaussian Objects for XR Communication
    Erzhen Hu
    Mingyi Li
    Seongkook Heo
    Adjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2024)
    Preview abstract During remote communication, participants share both digital and physical content, such as product designs, digital assets, and environments, to enhance mutual understanding. Recent advances in augmented communication have facilitated users to swiftly create and share digital 2D copies of physical objects from video feeds into a shared space. However, the conventional 2D representation of digital objects restricts users’ ability to spatially reference items in a shared immersive environment. To address these challenges, we propose Thing2Reality, an Extended Reality (XR) communication platform designed to enhance spontaneous discussions regard-ing both digital and physical items during remote sessions. WithThing2Reality, users can quickly materialize ideas or physical objects in immersive environments and share them as conditioned multiview renderings or 3D Gaussians. Our system enables users to interact with remote objects or discuss concepts in a collaborative manner. View details
    Experiencing InstructPipe: Building Multi-modal AI Pipelines via Prompting LLMs and Visual Programming
    Zhongyi Zhou
    Jing Jin
    Xiuxiu Yuan
    Jun Jiang
    Jingtao Zhou
    Yiyi Huang
    Kristen Wright
    Jason Mayes
    Mark Sherwood
    Ram Iyengar
    Na Li
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 5
    Preview abstract Foundational multi-modal models have democratized AI access, yet the construction of complex, customizable machine learning pipelines by novice users remains a grand challenge. This paper demonstrates a visual programming system that allows novices to rapidly prototype multimodal AI pipelines. We first conducted a formative study with 58 contributors and collected 236 proposals of multimodal AI pipelines that served various practical needs. We then distilled our findings into a design matrix of primitive nodes for prototyping multimodal AI visual programming pipelines, and implemented a system with 65 nodes. To support users' rapid prototyping experience, we built InstructPipe, an AI assistant based on large language models (LLMs) that allows users to generate a pipeline by writing text-based instructions. We believe InstructPipe enhances novice users onboarding experience of visual programming and the controllability of LLMs by offering non-experts a platform to easily update the generation. View details
    Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
    Phil A. Chou
    Hugues Hoppe
    Danhang Tang
    Jonathan Taylor
    Philip Davidson
    arXiv:2402.05887 (2024)
    Preview abstract We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming. View details
    Human I/O: Towards Comprehensive Detection of Situational Impairments in Everyday Activities
    Xingyu Bruce Liu
    Jiahao Nick Li
    Xiang 'Anthony' Chen
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 18
    Preview abstract Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in everyday activities. Despite their prevalence, existing adaptive systems predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a real-time system that detects SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves good performance in availability prediction across 60 in-the-wild egocentric videos in 32 different scenarios. Further, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the utility of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future. View details
    ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
    Brian Moreno Collins
    Karthik Ramani
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
    Preview abstract Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement. View details
    Modeling and Improving Text Stability in Live Captions
    Xingyu "Bruce" Liu
    Jun Zhang
    Leonardo Ferrer
    Susan Xu
    Vikas Bahirwani
    Boris Smus
    Extended Abstract of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, 208:1-9
    Preview abstract In recent years, live captions have gained significant popularity through its availability in remote video conferences, mobile applications, and the web. Unlike preprocessed subtitles, live captions require real-time responsiveness by showing interim speech-to-text results. As the prediction confidence changes, the captions may update, leading to visual instability that interferes with the user’s viewing experience. In this work, we characterize the stability of live captions by proposing a vision-based flickering metric using luminance contrast and Discrete Fourier Transform. Additionally, we assess the effect of unstable captions on the viewer through task load index surveys. Our analysis reveals significant correlations between the viewer's experience and our proposed quantitative metric. To enhance the stability of live captions without compromising responsiveness, we propose the use of tokenized alignment, word updates with semantic similarity, and smooth animation. Results from a crowdsourced study (N=123), comparing four strategies, indicate that our stabilization algorithms lead to a significant reduction in viewer distraction and fatigue, while increasing viewers' reading comfort. View details
    Experiencing Visual Blocks for ML: Visual Prototyping of AI Pipelines
    Na Li
    Jing Jin
    Michelle Carney
    Jun Jiang
    Xiuxiu Yuan
    Kristen Wright
    Mark Sherwood
    Jason Mayes
    Lin Chen
    Jingtao Zhou
    Zhongyi Zhou
    Ping Yu
    Ram Iyengar
    ACM (2023) (to appear)
    Preview abstract We demonstrate Visual Blocks for ML, a visual programming platform that facilitates rapid prototyping of ML-based multimedia applications. As the public version of Rapsai , we further integrated large language models and custom APIs into the platform. In this demonstration, we will showcase how to build interactive AI pipelines in a few drag-and-drops, how to perform interactive data augmentation, and how to integrate pipelines into Colabs. In addition, we demonstrate a wide range of community-contributed pipelines in Visual Blocks for ML, covering various aspects including interactive graphics, chains of large language models, computer vision, and multi-modal applications. Finally, we encourage students, designers, and ML practitioners to contribute ML pipelines through https://github.com/google/visualblocks/tree/main/pipelines to inspire creative use cases. Visual Blocks for ML is available at http://visualblocks.withgoogle.com. View details