Cady Tianyu Xu

Cady Tianyu Xu

Cady Tianyu Xu is a researcher on the GenAI team at Google DeepMind. She is currently developing LLM agents that move beyond open-ended generation toward closed-loop execution in dynamic environments. Her work addresses the gap between a model’s high-level reasoning and its ability to reliably execute multi-step tasks in the face of environmental uncertainty.

Portfolio | LinkedIn | Google Scholar

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Gaze Target Estimation Anywhere with Concepts
    Xu Cao
    Houze Yang
    Vipin Gunda
    Inki Kim
    Jim Rehg
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)
    Preview abstract Estimating human gaze targets in-the-wild is a formidable challenge. Existing computer vision algorithms rely on brittle, multi-stage pipelines that require explicit inputs like head bounding boxes and human pose, causing initial detection errors to cascade and lead to system failure. To overcome this, we introduce the \textbf{Promptable Gaze Target Estimation (PGE)} task, a new end-to-end, concept-driven paradigm. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person in point [0.52, 0.48]") to identify a specific subject's target, which eliminates the rigid dependency on intermediate localization cues. We develop a scalable data engine to generate \textbf{Gaze-Co}, a dataset and benchmark of 120K high-quality, prompt-annotated image pairs. We also propose \textbf{AnyGaze}, the first model designed for PGE. AnyGaze uses a Transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation. AnyGaze achieves state-of-the-art performance on standard gaze target estimation benchmarks, setting a strong baseline for this new problem even on a difficult out-of-domain, real-world clinical dataset. We will open-source the AnyGaze model and the Gaze-Co benchmark. View details
    MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
    Sieun Kim
    Qianhui Zheng
    Ruoyu Xu
    Ravi Tejasvi
    Anuva Kulkarni
    Junyi Zhu
    2026
    Preview abstract In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g. faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to five concurrent sources (e.g. two voices + three instruments) with ca. 2 second processing latency. We validate MoXaRt through a technical evaluation on a new, complex dataset we collected, and a 22-participant user study. Our results demonstrate that MoXaRt significantly improves communication clarity—boosting listening comprehension in noisy conditions by 33.2% (p=0.0058)—and significantly reduces cognitive load (M=7.50 vs. M=3.36, p<0.001), paving the way for more perceptive and socially adept XR experiences. View details
    EI-Lite: Electrical Impedance Sensing for Micro-gesture Recognition and Pinch Force Estimation
    Junyi Zhu
    Jiayu Wang
    Emily Guan
    JaeYoung Moon
    Stiven Morvan
    Andrea Colaco
    stefanie mueller
    Karan Ahuja
    Yiyue Luo
    Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, Association for Computing Machinery, New York, NY, USA (2025), pp. 1-14
    Preview abstract Micro-gesture recognition and fine-grain pinch press enables intuitive and discreet control of devices, offering significant potential for enhancing human-computer interaction (HCI). In this paper, we present EI-Lite, a lightweight wrist-worn electrical impedance sensing device for micro-gesture recognition and continuous pinch force estimation. We elicit an optimal and simplified device architecture through an ablation study on electrode placement with 13 users, and implement the elicited designs through 3D printing. We capture data on 15 participants on (1) six common micro-gestures (plus idle state) and (2) index finger pinch forces, then develop machine learning models that interpret the impedance signals generated by these micro-gestures and pinch forces. Our system is capable of accurate recognition of micro-gesture events (96.33% accuracy), as well as continuously estimating the pinch force of the index finger in physical units (Newton), with the mean-squared error (MSE) of 0.3071 (or mean-force-variance of 0.55 Newtons) over 15 participants. Finally, we demonstrate EI-Lite’s applicability via three applications in AR/VR, gaming, and assistive technologies. View details
    Preview abstract As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is imperative for maximizing user satisfaction and retention. However, lay users are notoriously bad at prompt specification and often struggle with conveying their latent preferences to AI assistants. To resolve this, we demonstrate that activation steering, an inference-time method, can effectively control the response of the LLMs towards expressing different preferences. In contrast to memory-based personalization methods that require long user history, steering is extremely lightweight and easily-controllable via an interpretable linear strength factor. We further conduct a within-subjects user study (n=14) to investigate how end users personalize their conversations through three different steerable chatbot interfaces. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with user preferences, and we discuss qualitative findings on how diverse values around control, transparency, and usability of personalization lead users to prefer different interfaces. View details
    Enhancing XR Auditory Realism via Scene-Aware Multimodal Acoustic Rendering
    Jihan Li
    Penghe Zu
    Pranav Sahay
    Maruchi Kim
    Jack Obeng-Marnu
    Farley Miller
    Mahitha Rachumalla
    Rajeev Nongpiur
    Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, Association for Computing Machinery, New York, NY, USA (2025), 1–16
    Preview abstract In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA’s feasibility and efficacy in enhancing XR auditory realism. View details
    ×