Cloud AI

We conduct groundbreaking research with the goal to infuse AI into Google Cloud products and infrastructure.

Charts

We conduct groundbreaking research with the goal to infuse AI into Google Cloud products and infrastructure.

About the team

The Cloud AI Research team, a dynamic group of scientists and engineers, is dedicated to conducting transformative, high-impact research and achieving fundamental breakthroughs in artificial intelligence and AI systems. We explore novel, high-potential directions, pioneering a new class of "Co-X" agents designed to automate and augment complex human tasks.

Our projects range from developing AI agents that can generate and validate award-worthy research papers to those that manage complex data center networks and power consumption. We are also developing state-of-the-art agents for ML engineering and data science, and exploring creative frontiers with agents that can direct long-form video content. Foundational to this is our work on core agent capabilities, such as long-term memory, automated multi-agent system design, and verifiable safety. We collaborate closely with partners to ship these innovations, ensuring our breakthroughs advance both Google's products and the state of science.

Team focus summaries

Expert & vertical agents

We develop "Co-X" agents designed to automate and augment complex professional workflows. This includes pioneering agents for AI research (Co-AI Researcher), ML engineering (Co-ML Engineer), Data Science (Co-Data Scientist), network management (Co-Network Engineer), AI scientist (Co-Scientist) and creative content generation (Co-Director).

Foundational agent capabilities

Our research builds the core technologies that enable more powerful and scalable agents. Key areas include developing robust long-term memory (Reflective Memory Management), automating the design of effective multi-agent systems (Agent Co-Designer), and establishing verifiable agent safety guardrails.

Deep reasoning and research

We focus on advancing the deep reasoning capabilities of AI agents. This work aims to push the state-of-the-art for systems that can conduct complex, multi-step research and analysis, with the goal of significantly outperforming existing industry benchmarks.

Featured publications

Deep Researcher with Test-time Diffusion
Guan Sun
Zoey CuiZhu
Yuanjun (Sophia) Bi
Weiming Wen
Hui Wan
Chunfeng Wen
Solène Maître
George Lee
Vishy Tirumalashetty
Emily Xue
Burak Gokturk
2025
Preview abstract Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design guides the report writing process to be more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents. View details
Preview abstract Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset. View details
Preview abstract Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLESTAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative. View details
Preview abstract We propose Heterogeneous Swarms, an algorithm to discover and adapt multi-LLM systems by jointly optimizing model roles and weights. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as input-output relationships and optimize the directed acyclic graph (DAG) of LLMs representing a multi-LLM system. Starting from a swarm of randomly initialized continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order with message passing, evaluate on the utility function, and optimize the adjacency matrices with swarm intelligence based on the utility score. For weight-step, we define JFK-score to evaluate the contribution of individual LLMs in the best-found DAG of the role-step, then optimize model weights with swarm intelligence based on the JFK-score. Extensive experiments demonstrate that Heterogeneous Swarms outperforms 15 baselines spanning role-based and weight-based approaches by 18.5% on average across 12 tasks and contexts. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of initial LLMs. View details
Preview abstract Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies View details
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong Wang
Steven Zheng
Swaroop Mishra
Yuwei Zhang
Anush Mattapalli
Ankur Taly
Jingbo Shang
ICLR 2025
Preview abstract Retrieval augmented generation (RAG) has attracted a lot of attention across both academia and industry due to its capability in inserting timely and accurate evidence to the generation by large language models. However, the introduction of retrieved evidence largely makes the input prompt longer, which would harm the understanding quality of large language models and make it slower in actual usage scenarios. To solve these issues, we propose SpeculativeRAG, which leverages a smaller LLM to conduct the retrieval augmented generation for a larger LLM. The smaller LLM can digest a few pieces of evidence and generate multiple pieces of drafts in parallel rapidly, and these drafts will be verified by a large LLM to guarantee the quality. We achieve a higher speed as well as a better quality in the RAG results. View details
Preview abstract Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task. View details