Athena

Solving fundamental computational problems that deliver meaningful impact for Google’s products, society, and scientific progress.

two people talking over coffee

Solving fundamental computational problems that deliver meaningful impact for Google’s products, society, and scientific progress.

About the team

Athena is an international team of research scientists and engineers who tackle product-inspired problems with novel solutions to assist, complement, empower, and inspire people — from the everyday to the imaginative. Our work spans algorithms, artificial intelligence (AI), language understanding, and many other fields, and yields state-of-the-art breakthroughs in areas like efficiency, privacy, and user engagement.

We collaborate closely with partners across Google to take discoveries from publication to implementation for the Company's largest and most trusted products. Beyond Google's portfolio of products and services, our contributions to AI, computer science and machine learning power scientific advances for climate science, journalism, microeconomics and other data-driven disciplines.

We recognize that AI is a foundational and transformational technology and are proud to contribute to a long history of responsible innovation. Our commitment to Responsible AI principles ensure we develop and use technologies in ways that are socially beneficial, avoid bias, are built and tested for safety, are accountable to people and aligned with our values

Team focus summaries

Graph-based learning

We extend machine learning approaches to better model the relationships contained in information networks. These models (e.g., semi-supervised similarity ranking & clustering, neural graph embedding, and graph convolutional approaches) are useful in a wide range of machine learning applications.

Learn more

Market algorithms

Auction theory, mechanism design, and advanced algorithms serve to improve Ads and other market-based products

Learn more

Operations research

Applying integer programming, linear programming, constraint programming, and graph algorithms to solve problems at scale for transportation, search, natural language understanding, computer vision, robotics and more.

Learn more

Language

We advance the state of the art in natural language technologies and build systems that learn to understand and generate language in context.

Learn more

Large-scale machine learning

We focus on large scale machine learning including supervised learning (e.g. deep learning and kernel-based learning), and semi/unsupervised learning (e.g. streaming clustering and efficient similarity search). The research areas include distributed optimization, personalization and privacy-preserving learning, on-device learning and inference, recommendation systems, data-dependent hashing, and learning-based vision. We develop principled approaches and apply them to Google’s products. Our team regularly publishes in top-tier learning conferences and journals. Our team’s work has been applied across Google, powering Search and Display Ads, YouTube, Android, Play, Gmail, Assistant and Google Shopping.

Online clustering

We provide fast clustering of the datasets that can scale to billions of datapoints, and a streaming throughput of hundreds of thousands of points per second. The goal is to provide scalable nonparametric clustering without making simplistic generative assumptions like convexity of clusters which are rarely true in practice. The team develops techniques that can handle drift in data distributions over time. These techniques are being used in a large number of applications including dynamic spam detection in multiple products and semantic expansion in NLP.

Modeling and data science

We sift through data to discover, understand, and model implicit signals in user behavior. We partner with Product Areas such as Ads, YouTube, Android, and more to add machine learning functionality to products across Google. Due to the open ended nature of data mining, ongoing projects vary and currently include smart notifications on Android, Ads Pricing optimizations, differential privacy work, and more.

Structured data

The goals of the Structured Data group are: 1) working with various product teams closely and leverage our expertise in structured data to solve challenging technical problems and initiate new product features; 2) providing scientific expertise in computational journalism across Google in the fight against digital misinformation; 3) drive a long-term agenda that will advance state-of-the-art research in structured data with real world impact.

Scalable matching

We develop techniques for large scale similarity search in massive databases with arbitrary data types (sparse or dense high dimensional data) and similarity measures (metric/non-metric, potentially learned from data). The focus has been on developing data-dependent ML-based hashing techniques and tree-hash hybrids that are driving a multitude of applications at Google. This team also develops techniques for fast inference in machine learning models including neural networks, often improving the speeds over 50x while maintaining near exact accuracy.

Speech and language algorithms

Our mission is to accurately and efficiently represent, combine, optimize and search models of speech and text. In particular, we devise automata, grammars, neural and other models that represent word histories, context-dependent lexicons for speech and keyboard, written-to-spoken transductions and extractions of dates, times, currency, measures, etc, and transliteration and contextual models of language. These can be combined and optimized to give high-accuracy, efficient speech recognition and synthesis, text normalization, and more. We provide efficient decoding algorithms to search these models. This work is used extensively in Google's speech and text processing infrastructure.

Sensitive content detection

Our mission is to create a comprehensive set of classifiers for detecting offensive, inappropriate & controversial content in images and video. We accomplish this using a variety of techniques, including ensembles of ML models that are trained on images and text from the web. We also apply transfer learning on deep vision models for domain-specific classifier creation.

Semi-supervised and unsupervised machine learning

Semi-supervised learning is increasingly critical to solving many real-world product problems where data is sparse, sparsely labeled, or noisy. We develop semi-supervised and unsupervised machine learning systems that operate at Google scale. We apply our research to a broad range of problems, including query understanding, conversation understanding, and media understanding.

ML model compression for mobile devices

We develop systems for transforming cloud-resident ML models to highly efficient models that run on resource-constrained mobile devices.

Media understanding in conversations

We enrich electronic conversations by understanding media using multi-modal signals from images, video, text, and the web. We accomplish this by marrying machine vision models with ML-enabled natural language understanding and generation systems.

Combinatorial machine learning

Many fundamental learning problems we solve at Google have non-trivial combinatorial structure that prevents the application of general purpose ML algorithms. They exhibit complex and discontinuous loss functions (e.g., in pricing) or combinatorial explosions (such as contextual bandits, feature selection, or integer programming) and may require solutions that are robust against strategic behavior. Our team pushes the boundaries in these areas through research that blends techniques from learning theory, game theory, and discrete/continuous optimization.

Glassbox

Glassbox Learning does R&D into making ML more controllable and interpretable, without sacrificing accuracy. An important line of research is how to translate policy goals about metrics and fairness into machine learning training. For interpretability, Glassbox provides end-to-end guarantees on the relationship of inputs to outputs, such as monotonicity and other shape constraints. To achieve these goals, Glassbox researches and utilizes new algorithms for constrained optimization.

Dataset search

Dataset Search, also known as Science Search, is a project to index all datasets on the web and to make the metadata (and, where possible, the data itself) searchable and useful. Datasets and related data tend to be spread across multiple data repositories on the web. In most cases, data is not linked nor has it been indexed which makes searching tedious or, in some cases, impossible.

Featured publications

LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yuanzhong Xu
arXiv (2022)
Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
Liam Li
Kevin Jamieson
Ameet Talwalkar
Journal of Machine Learning Research, 18-185 (2018), pp. 1-52
Preview abstract Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, øuralg , for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare øuralg with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that øuralg can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. View details
FNet: Mixing Tokens with Fourier Transforms
Ilya Eckstein
James Patrick Lee-Thorp
Joshua Ainslie
NAACL 2022 (Association for Computational Linguistics)
Preview abstract We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains nearly seven times faster on GPUs and twice as fast on TPUs. The resulting model, FNet, also scales very efficiently to long inputs. Specifically, when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, but is faster than the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts. View details
Understanding Robustness of Transformers for Image Classification
Daliang Li
Thomas Unterthiner
Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) (to appear)
Preview abstract Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture such as the use of non-overlapping patches lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification. View details
SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer
Tu Vu
Rami Al-Rfou
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2022)
Preview abstract There has been growing interest in parameter-efficient methods to apply pre-trained language models to downstream tasks. Building on the Prompt Tuning approach of Lester et al. (2021), which learns task-specific soft prompts to condition a frozen pre-trained model to perform different tasks, we propose a novel prompt-based transfer learning approach called SPoT: Soft Prompt Transfer. SPoT first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. We show that SPoT significantly boosts the performance of Prompt Tuning across many tasks. More remarkably, across all model sizes, SPoT matches or outperforms standard Model Tuning (which fine-tunes all model parameters) on the SuperGLUE benchmark, while using up to 27,000x fewer task-specific parameters. To understand where SPoT is most effective, we conduct a large-scale study on task transferability with 26 NLP tasks in 160 combinations, and demonstrate that many tasks can benefit each other via prompt transfer. Finally, we propose an efficient retrieval approach that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task. View details
Preview abstract Deep Learning has revolutionized the fields of Computer Vision, Natural Language, Speech, Information Retrieval and more. However, with the growth of Deep Learning models, the number of parameters, latency, resources required to train, all have increased significantly. Consequently, it has become important to focus on the footprint of the model, not just its quality. We present and motivate the problem of efficiency in Deep Learning, followed by a thorough survey of the five core areas of model efficiency and the seminal work there. We also present an experiment-based guide for practitioners to optimize their models. We believe this is the first comprehensive survey in the Efficient Deep Learning space. Our hope is that this survey would provide the reader with both the mental model and the necessary understanding of the field to firstly apply generic efficiency techniques to immediately get a sizeable improvements, and secondly ideas for experimentation to achieve additional gains. View details
Preview abstract Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In this work, we propose a general algorithmic framework, \mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as SGD and Adam to the federated learning setting. Mime uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that \mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show for the first time that multiple local steps can lead to faster convergence in the cross-device FL setting. Our thorough theoretical and empirical analyses establish Mime's superiority over other other baselines. View details
Preview abstract Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification. View details
Preview abstract We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with FACT, a Full-AttentionCross-modal Transformer network for generating 3D dance motion conditioned on music.The proposed AIST++dataset contains 1.1M frames of 3D dance motion in 1408sequences, covering 10 dance genres with multi-view videos with known camera poses—the largest dataset of this kind to our knowledge. We show that naively applying sequence models such as transformers to this dataset for the task of music conditioned 3D motion generation does not produce satisfactory 3D motion that is well correlated with the input music. We overcome these shortcomings by introducing key changes in its architecture design and supervision: FACT model involves a deep cross-modal transformer block with full-attention that is trained to predict N future motions.We empirically show that these changes are key factors in generating long sequences of realistic dance motion that is well-attuned to the input music. We conduct extensive experiments on AIST++ with user studies, where our method outperforms recent state-of-the-art methods both qualitatively and quantitatively. View details

Highlighted work

Some of our people