Machine Intelligence

Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize.

Machine Intelligence at Google raises deep scientific and engineering challenges, allowing us to contribute to the broader academic research community through technical talks and publications in major conferences and journals. Contrary to much of current theory and practice, the statistics of the data we observe shifts rapidly, the features of interest change as well, and the volume of data often requires enormous computation capacity. When learning systems are placed at the core of interactive services in a fast changing and sometimes adversarial environment, combinations of techniques including deep learning and statistical models need to be combined with ideas from control and game theory.

Recent Publications

USM-SCD: USM-Based Multilingual Speaker Change Detection
Yongqiang Wang
Jason Pelecanos
Yu Zhang
Yiling Huang
Han Lu
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11801-11805
Preview abstract We introduce a multilingual speaker change detection model (USM- SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune one-quarter of the trainable model parameters to achieve the best model performance. The USM-SCD model exhibits state-of-the-art ASR quality compared with a strong public ASR baseline, making it suitable to handle both tasks with negligible additional computational cost. View details
Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details
Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on User Trust in Large Language Models
Michelle Cohn
Femi Olanubi
Zion Mengesha
Daniel Padgett
CM (Association of Computing Machinery) CHI conference on Human Factors in Computing Systems 2024 (2024)
Preview abstract People now regularly interface with Large Language Models (LLMs) via speech and text (e.g., Bard) interfaces. However, little is known about the relationship between how users anthropomorphize an LLM system (i.e., ascribe human-like characteristics to a system) and how they trust the information the system provides. Participants (n=2,165; ranging in age from 18-90 from the United States) completed an online experiment, where they interacted with a pseudo-LLM that varied in modality (text only, speech + text) and grammatical person (“I” vs. “the system”) in its responses. Results showed that the “speech + text” condition led to higher anthropomorphism of the system overall, as well as higher ratings of accuracy of the information the system provides. Additionally, the first-person pronoun (“I”) led to higher information accuracy and reduced risk ratings, but only in one context. We discuss these findings for their implications for the design of responsible, human–generative AI experiences. View details
Practical Performance Guarantees for Pipelined DNN Inference
Kuikui Liu
Proceedings of the 41st International Conference on Machine Learning (2024)
Preview abstract This work optimizes pipeline parallelism of machine learning model inference by partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage. We design practical algorithms for this NP-complete problem and prove they are nearly optimal in practice by comparing against lower bounds obtained from solving novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound techniques to production models to achieve substantial improvements in the approximation guarantees, compared to simple combinatorial lower bounds. For example, our new MIP formulations improve the lower bounds enough to drop the geometric mean approximation ratio from $2.175$ to $1.082$ across production data with $k=16$ pipeline stages. This work shows that while bottleneck partitioning is theoretically hard, in practice we have a handle on the algorithmic side of the problem and much of the remaining challenge is in developing more accurate cost models to give to the partitioning algorithms. View details
Conformal Risk Control
Anastasios N. Angelopoulos
Stephen Bates
Adam Fisch
Lihua Lei
ICLR (2024)
Preview abstract We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an O(1/n) factor. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score. View details
Visual Program Tuning: Training Large Multimodal Models to Reason like Programs
Yushi Hu
Krishna Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
Conference on Computer Vision and Pattern Recognition (2024)
Preview abstract Solving complex visual tasks (e.g., “Who invented the musical instrument on the right?”) involves back-and-forth between visual processing and reasoning. Visual programming is a recent multimodal framework that has shown promise in conducting visual reasoning in an interpretable and compositional manner. However, this framework is error-prone—it can lead to a wrong answer whenever the program itself is wrong, or when any of the steps of the program are solved incorrectly, thus leading to worse overall performance than end-to-end systems trained with labeled data. Moreover, it is inefficient to involve multiple steps (i.e., generating and then running programs) during inference. Ideally, a single large multimodal model (LMM) should directly conduct similar reasoning and yield the correct answer. In this work, we propose Visual Program Tuning (VPT), which leverages visual programs for teaching LLMs to reason via instruction tuning. VPT rewrites the execution traces of visual programs as chain-of-thought reasoning steps, and tunes an LMM to output not only the label but its reasoning as well. Extensive experiments on complex vision tasks show that models trained with VPT achieve state-of-the-art accuracy while being able to produce interpretable and faithful reasoning steps. PaLI-X + VPT outperforms all existing LMMs on a wide range of visual tasks, improving performance on counting, spatial relations, and compositional reasoning tasks. VPT is also helpful for quick adaptation on new tasks. Our experiments on content moderation show that fine-tuning LMMs with program-augmented examples is more sample efficient than traditional supervised training. View details