Cloud AI

Our mission is to spread useful AI effectively around the world.


Our mission is to spread useful AI effectively around the world.

About the team

The Google Cloud AI Research team tackles AI research challenges motivated by Google Cloud’s mission of bringing AI to tech, healthcare, finance, retail and many other industries. We work on a range of unique high-impact problems with the goal of maximizing both scientific and real-world impact – both pushing the state-of-the-art in AI (>60 papers published at top research venues over the past four years) and collaborating across teams to bring innovations to production (e.g., 1, 2, 3).

Some recent directions for Cloud AI Research include:

  • Developing improved foundation models to solve challenges like hallucinations, data efficiency and generalization.
  • Improved adaptation methods, including distillation, task customization, grounding and multimodality.
  • Developing large language models (LLMs) for data types that are a high priority to enterprises, such as structured data.
  • Building LLMs for tool use.
  • Retrieval-augmented LLMs and LLM-assisted search.
  • Improved LLM usability through automated prompting, explainability and reliability.

Team focus summaries

Large language models for enterprise

Cloud AI researchers develop new large language models for problems that are critical to enterprise customers. These include innovative ways to distill large models while maintaining high performance; improving embeddings of large language models; translating natural language queries to business domain-specific languages like SQL; inventing new large multimodal models that learn from multiple modalities like text, image and structured data; scaling LLM tool usage to large number of tools; and automatic design of prompts for language models.

Explainable AI

Explainability is required to effectively use AI in real-world applications such as finance, healthcare, retail, manufacturing and others. Data scientists, business decision makers, regulators and others all need to know why AI models make certain decisions, and our researchers are working on a wide range of approaches to increase model explainability, including sample-based, feature-based or concept-based methods that utilize reinforcement learning, attention based architectures, prototypical learning, surrogate model optimization on all kinds of required data types and high impact tasks.

Data-efficient learning

Data-efficient learning is important, as for many AI deployments it is necessary to train models with only 100s of training examples. To this end Cloud AI researchers conduct research into active learning, self-supervised representation learning, transfer learning, domain adaptation and meta learning.

High-impact enterprise data types

Cloud AI researchers are looking at ways to advance the state of the art for specific data types such as time series and tabular data (two of the most common data types in AI deployments), which have received significantly less focus in the research community compared to other data types. In time series, we are actively developing new deep learning models with complex inputs – for example, the team’s novel Temporal Fusion Transformer architecture is state-of-the-art in terms of performance across a wide range of datasets. In tabular data, we developed TabNet, a new deep learning method for tabular data that achieves state-of-the-art performance on many datasets and yields interpretable insights.

Specific important enterprise use cases

Cloud AI researchers also conduct research targeting specific enterprise use cases, such as recommendation systems, which play a key role in the retail industry and face challenges in personalization, contextualization, trending, and diversification. We develop recommendation models that support event time-aware features, which captures user history events effectively for homepage recommendations. We also work on end-to-end document understanding which requires a holistic comprehension of structured information of a variety of documents, and recently developed contributed to society by providing a novel approach to forecasting the progression of COVID-19 that integrates machine learning into compartmental disease modeling.

Featured publications

Preview abstract Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task. View details
Preview abstract Multimodal large-scale pretraining has shown impressive performance gains for unstructured data including language, image, audio, and video. Yet, the scenario prominent in real-world applications is the existence of combination of structured (including tabular and time-series) and unstructured data in conjunction, and it has been understudied. Towards this end, we propose LANISTR, a novel attention-based framework to learn from LANguage, Image, and STRuctured data. We introduce a new multimodal fusion module with a similarity-based multimodal masking loss that enables LANISTR to learn cross-modal relations from large-scale multimodal data with missing modalities during training and test time. On two publicly available MIMIC-IV and Amazon Product Review datasets, LANISTR achieves absolute improvements of 6.47% (AUROC) and 8.35% (accuracy), respectively, compared to the state-of-the-art multimodal models, while showing superior generalization capabilities. View details
SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch
Chun-Liang Li
Kihyuk Sohn
Transactions on Machine Learning Research (TMLR)(2023)
Preview abstract Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications -- for example, the labeled data may contain only anomalies unlike unlabeled data, or unlabeled data may contain different types of anomalies, or labeled data may contain only `easy-to-label' samples. SPADE utilizes an ensemble of one class classifiers as the pseudo-labeler to improve the robustness of pseudo-labeling with distribution mismatch. Partial matching is proposed to automatically select the critical hyper-parameters for pseudo-labeling without validation data, which is crucial with limited labeled data. SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios with distribution mismatch in both tabular and image domains. In some common real-world settings such as model facing new types of unlabeled anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC in average. View details
Preview abstract Modern large language models (LLMs) have demonstrated impressive capabilities at sophisticated tasks, often through step-by-step reasoning similar to humans. This is made possible by their strong few-shot and zero shot abilities: they either learn from a handful of handcrafted, completed responses (“in context examples”), or are prompted to reason spontaneously through specially designed triggers. Nonetheless, few-shot performance is sensitive to the choice of the examples, for which artisanal hand-crafted selection would require extensive effort, and in some cases, it might not even be possible to obtain relevant examples a-priori without expertise about the downstream tasks. On the other hand, most general and handcrafting-free, zero-shot performance is limited by the lack of guidance to the LLM. To address this, we propose Consistency-based Self-adaptive Prompting (COSP), a novel prompt design method for LLMs. Requiring neither handcrafted responses nor ground-truth labels, COSP selects & builds the set of examples from the LLM’s own zero-shot outputs via carefully designed criteria combining consistency, diversity and repetition. In zero-shot setting, with only LLM predictions, COSP significantly improves performance (up to 2× compared to zero-shot baselines and matching or exceeding few-shot baselines) in a range of reasoning tasks in 3 LLMs. Moreover, COSP can be generalized to few-shot setting and can take advantage of few labeled examples in an efficient way View details
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
Cheng-Yu Hsieh
Si-An Chen
Chun-Liang Li
Alexander Ratner
Ranjay Krishna
arXiv preprint arXiv:2308.00675(2023)
Preview abstract Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models. View details
Preview abstract We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant. View details
Preview abstract Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting. The implementation is available at: tsmixer . View details
Preview abstract Multi-horizon prediction problems often contain a complex mix of inputs -- including static covariates, known future inputs, and other exogenous time series -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layer for learning long-term dependencies. The TFT also utilizes specialized components for judicious selection of the relevant features, and series of gating layers to suppress unnecessary components -- enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of our model. View details
Preview abstract The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size. View details

Highlighted work

Some of our locations