Machine Intelligence

Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize.

Machine Intelligence at Google raises deep scientific and engineering challenges, allowing us to contribute to the broader academic research community through technical talks and publications in major conferences and journals. Contrary to much of current theory and practice, the statistics of the data we observe shifts rapidly, the features of interest change as well, and the volume of data often requires enormous computation capacity. When learning systems are placed at the core of interactive services in a fast changing and sometimes adversarial environment, combinations of techniques including deep learning and statistical models need to be combined with ideas from control and game theory.

Recent Publications

Preview abstract The field of Human-Computer Interaction is approaching a critical inflection point, moving beyond the era of static, deterministic systems into a new age of self-evolving systems. We introduce the concept of Adaptive generative interfaces that move beyond static artifacts to autonomously expand their own feature sets at runtime. Rather than relying on fixed layouts, these systems utilize generative methods to morph and grow in real-time based on a user’s immediate intent. The system operates through three core mechanisms: Directed synthesis (generating new features from direct commands), Inferred synthesis (generating new features for unmet needs via inferred commands), and Real-time adaptation (dynamically restructuring the interface's visual and functional properties at runtime). To empirically validate this paradigm, we executed a within-subject (repeated measures) comparative study (N=72) utilizing 'Penny,' a digital banking prototype. The experimental design employed a counterbalanced Latin Square approach to mitigate order effects, such as learning bias and fatigue, while comparing Deterministic interfaces baseline against an Adaptive generative interfaces. Participant performance was verified through objective screen-capture evidence, with perceived usability quantified using the industry-standard System Usability Scale (SUS). The results demonstrated a profound shift in user experience: the Adaptive generative version achieved a System Usability Scale (SUS) score of 84.38 ('Excellent'), significantly outperforming the Deterministic version’s score of 53.96 ('Poor'). With a statistically significant mean difference of 30.42 points (p < 0.0001) and a large effect size (d=1.04), these findings confirm that reducing 'navigation tax' through adaptive generative interfaces directly correlates with a substantial increase in perceived usability. We conclude that deterministic interfaces are no longer sufficient to manage the complexity of modern workflows. The future of software lies not in a fixed set of pre-shipped features, but in dynamic capability sets that grow, adapt, and restructure themselves in real-time to meet the specific intent of the user. This paradigm shift necessitates a fundamental transformation in product development, requiring designers to transcend traditional, linear workflows and evolve into 'System Builders'—architects of the design principles and rules that facilitate this new age of self-evolving software. View details
MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
Sieun Kim
Qianhui Zheng
Ruoyu Xu
Ravi Tejasvi
Anuva Kulkarni
Junyi Zhu
2026
Preview abstract In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g. faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to five concurrent sources (e.g. two voices + three instruments) with ca. 2 second processing latency. We validate MoXaRt through a technical evaluation on a new, complex dataset we collected, and a 22-participant user study. Our results demonstrate that MoXaRt significantly improves communication clarity—boosting listening comprehension in noisy conditions by 33.2% (p=0.0058)—and significantly reduces cognitive load (M=7.50 vs. M=3.36, p<0.001), paving the way for more perceptive and socially adept XR experiences. View details
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Yehonathan Refael
Amit Aides
Aviad Barzilai
Vered Silverman
Bolous Jaber
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 886-894
Preview abstract Open-vocabulary object detection (OVD) models offer remarkable flexibility applications by enabling object detection from arbitrary text queries. Still, the zero-shot performance of the pre-trained models is hampered by the inherent semantic ambiguity of natural language, result to low precision, leading to insufficient crucial downstream applications. For instance, in the remote sensing (RS) domain, a query for "ship" can yield varied and contextually irrelevant results. To address this, for real time applications, we propose a novel cascaded architecture that synergizes the broad capabilities of a large, pre-trained OVD model with a lightweight, few-shot classifier. Our approach utilizes the frozen weights of the zero-shot model to generate initial, high-recall object-embedding proposals, which are then refined by a compact classifier trained in real-time on a handful of user-annotated examples. The core of our contribution is an efficient one step active learning strategy for selecting the most informative samples for user annotation. Our method identifies (extremely) small amount of an uncertain candidates near the theoretical decision boundary using density estimation and then applies clustering to ensure a diverse training set. This targeted sampling enables our cascaded system to elevate performance on standard remote sensing benchmarks. Our work thus presents a practical and resource-efficient framework for adapting foundational models to specific user needs, drastically reducing annotation overhead while achieving high accuracy without costly full-model fine-tuning. View details
VISTA: A Test-Time Self-Improving Video Generation Agent
Xuan Long Do
Hootan Nakhost
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (to appear) (2026)
Preview abstract Despite rapid advances in text-to-video (T2V) synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. To address this, we introduce VISTA, a novel multi-agent system that autonomously refines prompts to improve video generation. VISTA operates in an iterative loop, first decomposing a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. To rigorously evaluate our proposed approach, we introduce MovieGen-Bench, a new benchmark of diverse single- and multi-scene video generation tasks. Experiments show that while prior methods yield inconsistent gains, VISTA consistently improves video quality, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 68% of comparisons. View details
Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
A Framework for Interactive Machine Learning and Enhanced Conversational Systems
Jerry Young
Richard Abisla
Sanjay Batra
Mikki Phan
Nature, Springer-Verlag (2026)
Preview abstract Conversational systems are increasingly prevalent, yet current versions often fail to support the full range of human speech, including variations in speed, rhythm, syntax, grammar, articulation, and resonance. This reduces their utility for individuals with dysarthria, apraxia, dysphonia, and other language and speech-related disabilities. Building on research that emphasizes the need for specialized datasets and model training tools, our study uses a scaffolded approach to understand the ideal model training and voice recording process. Our findings highlight two distinct user flows for improving model training and provide six guidelines for future conversational system-related co-design frameworks. This study offers important insights on creating more effective conversational systems by emphasizing the need to integrate interactive machine learning into training strategies. View details
×