Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10133 publications
Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach
Anant Modwal
Govind Kaushal
Ramanan Balakrishnan
Shanay Shah
Monu Agrawal
Justin Lin
Prakash Hariramani
Priya Mandawat
Rutvik Karve
Naveen Madiraju
Preview abstract
Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes.
View details
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Sadeep Jayasumana
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
View details
Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning
Karel Král
Marina Zhang
Transactions on Cryptographic Hardware and Embedded Systems (TCHES), IACR (2024)
Preview abstract
To make cryptographic processors more resilient against side-channel attacks, engineers have developed various countermeasures. However, the effectiveness of these countermeasures is often uncertain, as it depends on the complex interplay between software and hardware. Assessing a countermeasure’s effectiveness using profiling techniques or machine learning so far requires significant expertise and effort to be adapted to new targets which makes those assessments expensive. We argue that including cost-effective automated attacks will help chip design teams to quickly evaluate their countermeasures during the development phase, paving the way to more secure chips.In this paper, we lay the foundations toward such automated system by proposing GPAM, the first deep-learning system for power side-channel analysis that generalizes across multiple cryptographic algorithms, implementations, and side-channel countermeasures without the need for manual tuning or trace preprocessing. We demonstrate GPAM’s capability by successfully attacking four hardened hardware-accelerated elliptic-curve digital-signature implementations. We showcase GPAM’s ability to generalize across multiple algorithms by attacking a protected AES implementation and achieving comparable performance to state-of-the-art attacks, but without manual trace curation and within a limited budget. We release our data and models as an open-source contribution to allow the community to independently replicate our results and build on them.
View details
Specifying BGP using TLA+
Aman Shaikh
(2024)
Preview abstract
This presentation is about the TLA+ specification we have written for BGP, the routing protocol underpinning the Internet. The specification also serves as a crucial first-step towards the use of TLA+ for verification of network designs.
View details
Preview abstract
In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function
View details
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta
Ofir Nachum
Yutaka Matsuo
Shane Gu
Izzeddin Gur
International Conference on Learning Representations (ICLR) (2024)
Preview abstract
This paper discusses a method to inject text when training an ASR system without the need for up sampling the text sequence to match the length of the speech sequence.
View details
Geographical accessibility to emergency obstetric care in urban Nigeria using closer-to-reality travel time estimates
Aduragbemi Banke-Thomas
Kerry L. M. Wong
Tope Olubodun
Peter M. Macharia
Narayanan Sundararajan
Yash Shah
Mansi Kansal
Swapnil Vispute
Olakunmi Ogunyemi
Uchenna Gwacham-Anisiobi
Jia Wang
Ibukun-Oluwa Omolade Abejirinde
Prestige Tatenda Makanga
Ngozi Azodoh
Charles Nzelu, PhD
Charlotte Stanton
Bosede B. Afolabi
Lenka Beňová
Lancet Global Health (2024)
Preview abstract
Background
Better accessibility of emergency obstetric care (CEmOC) facilities can significantly reduce maternal and perinatal deaths. However, pregnant women living in urban settings face additional complex challenges travelling to facilities. We estimated geographical accessibility and coverage to the nearest, second nearest, and third nearest public and private CEmOC facilities in the 15 largest Nigerian cities.
Methods
We mapped city boundaries, verified and geocoded functional CEmOC facilities, and assembled population distribution for women of childbearing age (WoCBA). We used Google Maps Platform’s internal Directions Application Programming Interface (API) to derive driving times to public, private, or either facility-type. Median travel time (MTT) and percentage of WoCBA able to reach care were summarised for eight traffic scenarios (peak and non-peak hours on weekdays and weekends) by city and within-city (wards) under different travel time thresholds (<15, <30, <60 min).
Findings
City-level MTT to the nearest CEmOC facility ranged from 18min (Maiduguri) to 46min (Kaduna). Within cities, MTT varied by location, with informal settlements and peripheral areas being the worst off. The percentages of WoCBA within 60min to their nearest public CEmOC were nearly universal; whilst the percentages of WoCBA within 30min reach to their nearest public CEmOC were between 33% in Aba to over 95% in Ilorin and Maiduguri. During peak traffic times, the median number of public CEmOC facilities reachable by WoCBA under 30min was zero in eight of 15 cities.
Interpretation
This approach provides more context-specific, finer, and policy-relevant evidence to support improving CEmOC service accessibility in urban Africa.
View details
Towards Generalist Biomedical AI
Danny Driess
Andrew Carroll
Chuck Lau
Ryutaro Tanno
Ira Ktena
Anil Palepu
Basil Mustafa
Aakanksha Chowdhery
Simon Kornblith
Philip Mansfield
Sushant Prakash
Renee Wong
Sunny Virmani
Sara Mahdavi
Bradley Green
Ewa Dominowska
Joelle Barral
Karan Singhal
Pete Florence
NEJM AI (2024)
Preview abstract
BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery.
METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports.
RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems.
View details
Stable quantum-correlated many-body states through engineered dissipation
Xiao Mi
Alexios Michailidis
Sara Shabani
Jerome Lloyd
Rajeev Acharya
Igor Aleiner
Trond Andersen
Markus Ansmann
Frank Arute
Kunal Arya
Juan Atalaya
Gina Bortoli
Alexandre Bourassa
Leon Brill
Michael Broughton
Bob Buckley
Tim Burger
Nicholas Bushnell
Jimmy Chen
Benjamin Chiaro
Desmond Chik
Charina Chou
Josh Cogan
Roberto Collins
Paul Conner
William Courtney
Alex Crook
Ben Curtin
Alejo Grajales Dau
Dripto Debroy
Agustin Di Paolo
ILYA Drozdov
Andrew Dunsworth
Lara Faoro
Edward Farhi
Reza Fatemi
Vinicius Ferreira
Ebrahim Forati
Brooks Foxen
Élie Genois
William Giang
Dar Gilboa
Raja Gosula
Steve Habegger
Michael Hamilton
Monica Hansen
Sean Harrington
Paula Heu
Markus Hoffmann
Trent Huang
Ashley Huff
Bill Huggins
Sergei Isakov
Justin Iveland
Cody Jones
Pavol Juhas
Kostyantyn Kechedzhi
Marika Kieferova
Alexei Kitaev
Andrey Klots
Alexander Korotkov
Fedor Kostritsa
John Mark Kreikebaum
Dave Landhuis
Pavel Laptev
Kim Ming Lau
Lily Laws
Joonho Lee
Kenny Lee
Yuri Lensky
Alexander Lill
Wayne Liu
Orion Martin
Amanda Mieszala
Shirin Montazeri
Alexis Morvan
Ramis Movassagh
Wojtek Mruczkiewicz
Charles Neill
Ani Nersisyan
Michael Newman
JiunHow Ng
Murray Ich Nguyen
Tom O'Brien
Alex Opremcak
Andre Petukhov
Rebecca Potter
Leonid Pryadko
Charles Rocque
Negar Saei
Kannan Sankaragomathi
Henry Schurkus
Christopher Schuster
Mike Shearn
Aaron Shorter
Noah Shutty
Vladimir Shvarts
Jindra Skruzny
Clarke Smith
Rolando Somma
George Sterling
Doug Strain
Marco Szalay
Alfredo Torres
Guifre Vidal
Cheng Xing
Jamie Yao
Ping Yeh
Juhwan Yoo
Grayson Young
Yaxing Zhang
Ningfeng Zhu
Jeremy Hilton
Anthony Megrant
Yu Chen
Vadim Smelyanskiy
Dmitry Abanin
Science, 383 (2024), pp. 1332-1337
Preview abstract
Engineered dissipative reservoirs have the potential to steer many-body quantum systems toward correlated steady states useful for quantum simulation of high-temperature superconductivity or quantum magnetism. Using up to 49 superconducting qubits, we prepared low-energy states of the transverse-field Ising model through coupling to dissipative auxiliary qubits. In one dimension, we observed long-range quantum correlations and a ground-state fidelity of 0.86 for 18 qubits at the critical point. In two dimensions, we found mutual information that extends beyond nearest neighbors. Lastly, by coupling the system to auxiliaries emulating reservoirs with different chemical potentials, we explored transport in the quantum Heisenberg model. Our results establish engineered dissipation as a scalable alternative to unitary evolution for preparing entangled many-body states on noisy quantum processors.
View details
Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction
Babak Behsaz
Zachary Ryan Mccaw
Davin Hill
Robert Luben
Dongbing Lai
John Bates
Howard Yang
Tae-Hwi Schwantes-An
Yuchen Zhou
Anthony Khawaja
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2024)
Preview abstract
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.
View details
An intentional approach to managing bias in embedding models
Atilla P. Kiraly
Jungyeon Park
Rory Pilgrim
Charles Lau
Heather Cole-Lewis
Shravya Shetty
Krish Eswaran
Leo Anthony Celi
The Lancet Digital Health, 6 (2024), E126-E130
Preview abstract
Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.
View details
Visual Program Tuning: Training Large Multimodal Models to Reason like Programs
Yushi Hu
Krishna Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
Conference on Computer Vision and Pattern Recognition (2024)
Preview abstract
Solving complex visual tasks (e.g., “Who invented the musical instrument on the right?”) involves back-and-forth between visual processing and reasoning. Visual programming is a recent multimodal framework that has shown promise in conducting visual reasoning in an interpretable and compositional manner. However, this framework is error-prone—it can lead to a wrong answer whenever the program itself is wrong, or when any of the steps of the program are solved incorrectly, thus leading to worse overall performance than end-to-end systems trained with labeled data. Moreover, it is inefficient to involve multiple steps (i.e., generating and then running programs) during inference. Ideally, a single large multimodal model (LMM) should directly conduct similar reasoning and yield the correct answer.
In this work, we propose Visual Program Tuning (VPT), which leverages visual programs for teaching LLMs to reason via instruction tuning. VPT rewrites the execution traces of visual programs as chain-of-thought reasoning steps, and tunes an LMM to output not only the label but its reasoning as well. Extensive experiments on complex vision tasks show that models trained with VPT achieve state-of-the-art accuracy while being able to produce interpretable and faithful reasoning steps. PaLI-X + VPT outperforms all existing LMMs on a wide range of visual tasks, improving performance on counting, spatial relations, and compositional reasoning tasks. VPT is also helpful for quick adaptation on new tasks. Our experiments on content moderation show that fine-tuning LMMs with program-augmented examples is more sample efficient than traditional supervised training.
View details
Resolving Code Review Comments with Machine Learning
Alexander Frömmgen
Peter Choy
Elena Khrapko
Marcus Revaj
2024 IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (to appear)
Preview abstract
Code reviews are a critical part of the software development process, taking a significant amount of the code authors’ and the code reviewers’ time. As part of this process, the reviewer inspects the proposed code and asks the author for code changes through comments written in natural language. At Google, we see millions of reviewer comments per year, and authors require an average of ∼60 minutes active shepherding time between sending changes for review and finally submitting the change. In our measurements, the required active work time that the code author must devote to address reviewer comments grows almost linearly with the number of comments. However, with machine learning (ML), we have an opportunity to automate and streamline the code-review process, e.g., by proposing code changes based on a comment’s text.
We describe our application of recent advances in large sequence models in a real-world setting to automatically resolve code-review comments in the day-to-day development workflow at Google. We present the evolution of this feature from an asynchronous generation of suggested edits after the reviewer sends feedback, to an interactive experience that suggests code edits to the reviewer at review time. In deployment, code-change authors at Google address 7.5% of all reviewer comments by applying an ML-suggested edit. The impact of this will be to reduce the time spent on code reviews by hundreds of thousands of engineer hours annually at Google scale. Unsolicited, very positive feedback highlights that the impact of ML-suggested code edits increases Googlers’ productivity and allows them to focus on more creative and complex tasks.
View details