Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11268 publications
    MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
    Sieun Kim
    Qianhui Zheng
    Ruoyu Xu
    Ravi Tejasvi
    Anuva Kulkarni
    Junyi Zhu
    2026
    Preview abstract In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g. faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to five concurrent sources (e.g. two voices + three instruments) with ca. 2 second processing latency. We validate MoXaRt through a technical evaluation on a new, complex dataset we collected, and a 22-participant user study. Our results demonstrate that MoXaRt significantly improves communication clarity—boosting listening comprehension in noisy conditions by 33.2% (p=0.0058)—and significantly reduces cognitive load (M=7.50 vs. M=3.36, p<0.001), paving the way for more perceptive and socially adept XR experiences. View details
    ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
    Guy Tennenholtz
    Jihwan Jeong
    The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026)
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    Preview abstract This writeup defines the Hydration Proxy Pattern, a framework for building stateful conversational data systems over stateless LLM APIs. It describes a platform-agnostic approach to decoupling persistence from the AI provider through secure server-side intermediation and hybrid storage tiers. The abstract provides a blueprint for managing the "Persistence Gap" in enterprise AI integrations, detailing high-level strategies for session history management, streaming, and multi-stage semantic grounding without disclosing specific internal implementation details. View details
    Improving Low-Vision Chart Accessibility via On-Cursor Visual Context
    Yotam Sechayk
    Hennes Rave
    Max Radler
    Mark Colley
    Ariel Shamir
    Takeo Igarashi
    Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26)
    Preview abstract Despite widespread use, charts remain largely inaccessible for Low-Vision Individuals (LVI). Reading charts requires viewing data points within a global context, which is difficult for LVI who may rely on magnification or experience a partial field of vision. We aim to improve exploration by providing visual access to critical context. To inform this, we conducted a formative study with five LVI. We identified four fundamental contextual elements common across chart types: axes, legend, grid lines, and the overview. We propose two pointer-based interaction methods to provide this context: Dynamic Context, a novel focus+context interaction, and Mini-map, which adapts overview+detail principles for LVI. In a study with N=22 LVI, we compared both methods and evaluated their integration to current tools. Our results show that Dynamic Context had significant positive impact on access, usability, and effort reduction; however, worsened visual load. Mini-map strengthened spatial understanding, but was less preferred for this task. We offer design insights to guide the development of future systems that support LVI with visual context while balancing visual load. View details
    Progressive Photorealistic Simplification
    Adi Rosenthal
    Yedid Hoshen
    Arik Shamir
    2026
    Preview abstract Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative \emph{Select–Remove–Verify} pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods. View details
    Phoenix: Rowhammer Attacks on DDR5 with Self-Correcting Synchronization
    Michele Marazzi
    Kaveh Razavi
    Salman Qazi
    Diego Meyer
    Patrick Jattke
    IEEE Security & Privacy (S&P) (2026)
    Preview abstract In some multi-stage software build pipelines, downstream compiler errors may be reported against ephemeral, machine-generated intermediate artifacts rather than original, human-written source code, which can make remediation challenging. A system and method may address this by intercepting a downstream error, mapping its location back to the original source file, and programmatically injecting a dormant suppression tag into the original source code. During a subsequent build, an intermediate transpiler can propagate this tag into a newly generated intermediate artifact. In the intermediate file, the tag may become active and be recognized by the downstream compiler as a directive to suppress the specific error. This approach can facilitate an automated remediation process for certain build failures that avoids direct modification of ephemeral files and uses the original source code as a record for suppression. View details
    Preview abstract Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices. View details
    Preview abstract The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D computer vision has catalyzed significant research into their adaptation for the complex domain of 3D analysis. However, a fundamental dichotomy exists between the regular, dense grid of 2D images and the irregular, sparse nature of 3D data formats such as point clouds and meshes. This paper provides a comprehensive survey and a novel intellectual framework for navigating this burgeoning field. Our core contribution is a new taxonomy that organizes adaptation strategies into three distinct families: (1) Data-centric methods, which project 3D data into 2D formats to leverage off-the-shelf 2D models; (2) Architecture-centric methods, which design intrinsic network modules to directly process 3D data; and (3) Hybrid methods, which synergistically combine pre-trained 2D features with 3D modeling processing pipelines to benefit from both rich visual priors and explicit geometric reasoning. Through this taxonomic lens, we conduct a systematic review and qualitative synthesis of the field. We illuminate the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. Based on this analysis, we identify and discuss critical open challenges and chart promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning for geometric data, and the deeper integration of multi-modal signals. This survey serves as an essential resource and roadmap for researchers seeking to understand and advance the state-of-the-art in 3D computer vision. View details
    Preview abstract Multimodal large language models (LLMs) integrate and process information from multiple modalities such as text, images, audio, and video, enabling complex tasks such as audio translation and visual question answering. While powerful, this complexity introduces novel vulnerabilities to sophisticated adversarial attacks. This survey paper provides a comprehensive overview of this rapidly expanding field, systematically categorizing attacks that range from manipulations of single modalities (e.g., perturbed images or audio) to those exploiting cross-modal interactions. We overview how these attacks exploit weaknesses in model fusion, attention mechanisms, and representation learning and provided analyses on their potential for real-world consequences. View details
    GUIDE: A Benchmark for User Context Understanding and Assistance in GUI Workflow Videos
    Saelyne Yang
    Jaesang Yu
    Yi-Hao Peng
    Kevin Qinghong Lin
    Jae Won Cho
    Juho Kim
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)
    Preview abstract Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software. While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency.To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI Understanding, Intent, and Help Decision Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations that surface user intent, across 10 complex software (e.g., PowerPoint, Photoshop). GUIDE defines three tasks—(i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model’s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled with the tasks, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context such as behavioral state and intent significantly improved the performance, raising help prediction by up to 50.2%. These results highlight the critical role of structured user understanding in effective assistance.Our benchmark provides a path toward GUI agents that go beyond automation to become truly user-aware collaborators. View details
    Preview abstract The field of Human-Computer Interaction is approaching a critical inflection point, moving beyond the era of static, deterministic systems into a new age of self-evolving systems. We introduce the concept of Adaptive generative interfaces that move beyond static artifacts to autonomously expand their own feature sets at runtime. Rather than relying on fixed layouts, these systems utilize generative methods to morph and grow in real-time based on a user’s immediate intent. The system operates through three core mechanisms: Directed synthesis (generating new features from direct commands), Inferred synthesis (generating new features for unmet needs via inferred commands), and Real-time adaptation (dynamically restructuring the interface's visual and functional properties at runtime). To empirically validate this paradigm, we executed a within-subject (repeated measures) comparative study (N=72) utilizing 'Penny,' a digital banking prototype. The experimental design employed a counterbalanced Latin Square approach to mitigate order effects, such as learning bias and fatigue, while comparing Deterministic interfaces baseline against an Adaptive generative interfaces. Participant performance was verified through objective screen-capture evidence, with perceived usability quantified using the industry-standard System Usability Scale (SUS). The results demonstrated a profound shift in user experience: the Adaptive generative version achieved a System Usability Scale (SUS) score of 84.38 ('Excellent'), significantly outperforming the Deterministic version’s score of 53.96 ('Poor'). With a statistically significant mean difference of 30.42 points (p < 0.0001) and a large effect size (d=1.04), these findings confirm that reducing 'navigation tax' through adaptive generative interfaces directly correlates with a substantial increase in perceived usability. We conclude that deterministic interfaces are no longer sufficient to manage the complexity of modern workflows. The future of software lies not in a fixed set of pre-shipped features, but in dynamic capability sets that grow, adapt, and restructure themselves in real-time to meet the specific intent of the user. This paradigm shift necessitates a fundamental transformation in product development, requiring designers to transcend traditional, linear workflows and evolve into 'System Builders'—architects of the design principles and rules that facilitate this new age of self-evolving software. View details
    Preview abstract Artificial intelligence is rapidly evolving, marked by the emergence of Large Language Model (LLM) agents – systems capable of complex reasoning, planning, and interaction with digital and physical environments. These agents, powered by advancements in LLMs, demonstrate remarkable capabilities across diverse domains, including finance, healthcare, web navigation, software development, and daily task assistance. Unlike traditional AI systems, LLM agents can perceive their surroundings, formulate multi-step plans, utilize external tools and APIs, access memory or knowledge bases, and execute actions to achieve specified goals. This ability to act upon the world, however, introduces significant safety and security challenges. The safety paradigms developed for traditional LLMs, primarily focused on mitigating harmful textual outputs (e.g., toxicity, bias), are insufficient for safeguarding LLM agents. Agents interacting with dynamic environments and executing actions present a broader attack surface and new categories of risk. These include performing unsafe operations, violating privacy constraints through improper data handling or access control failures, deviating from user objectives (task misalignment), and susceptibility to novel manipulation techniques like indirect prompt injection and memory poisoning. Ensuring the trustworthy operation of these powerful agents is paramount, especially as they are integrated into high-stakes applications. To address this critical challenge, we introduce VeriGuard, a novel framework designed to enhance the safety and reliability of LLM agents by interactively verifying their policies and the actions. VeriGuard integrates a verification module that intercepts code-based actions proposed by the agent. In the first step, VeriGuard will generates and verifies the policies. The policies are rigorously checked against a set of predefined safety and security specifications Then each action will be verified to make sure it will align with the agent specification. This interactive verification loop ensures that the agent's behavior remains within safe operational bounds, effectively preventing the execution of harmful or unintended operations. By verifying each step, VeriGuard provides a robust safeguard, substantially improving the trustworthiness of LLM agents in complex, real-world environments. View details
    Exponential quantum advantage in processing massive classical data
    Haimeng Zhao
    Alexander Zlokapa
    John Preskill
    Hsin-Yuan (Robert) Huang
    arXiv:2604.07639 (2026)
    Preview abstract Broadly applicable quantum advantage, particularly in classical data processing and machine learning, has been a fundamental open problem. In this work, we prove that a small quantum computer of polylogarithmic size can perform large-scale classification and dimension reduction on massive classical data by processing samples on the fly, whereas any classical machine achieving the same prediction performance requires exponentially larger size. Furthermore, classical machines that are exponentially larger yet below the required size need superpolynomially more samples and time. We validate these quantum advantages in real-world applications, including single-cell RNA sequencing and movie review sentiment analysis, demonstrating four to six orders of magnitude reduction in size with fewer than 60 logical qubits. These quantum advantages are enabled by quantum oracle sketching, an algorithm for accessing the classical world in quantum superposition using only random classical data samples. Combined with classical shadows, our algorithm circumvents the data loading and readout bottleneck to construct succinct classical models from massive classical data, a task provably impossible for any classical machine that is not exponentially larger than the quantum machine. These quantum advantages persist even when classical machines are granted unlimited time or if BPP=BQP, and rely only on the correctness of quantum mechanics. Together, our results establish machine learning on classical data as a broad and natural domain of quantum advantage and a fundamental test of quantum mechanics at the complexity frontier. View details
    Preview abstract Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines. View details
    ×