Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10093 publications
Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao
Samuel Schulter
Zhixing Zhang
Vijay Kumar B G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively.
View details
Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on User Trust in Large Language Models
Michelle Cohn
Femi Olanubi
Zion Mengesha
Daniel Padgett
CM (Association of Computing Machinery) CHI conference on Human Factors in Computing Systems 2024 (2024)
Preview abstract
People now regularly interface with Large Language Models (LLMs) via speech and text (e.g., Bard) interfaces. However, little is known about the relationship between how users anthropomorphize an LLM system (i.e., ascribe human-like characteristics to a system) and how they trust the information the system provides. Participants (n=2,165; ranging in age from 18-90 from the United States) completed an online experiment, where they interacted with a pseudo-LLM that varied in modality (text only, speech + text) and grammatical person (“I” vs. “the system”) in its responses. Results showed that the “speech + text” condition led to higher anthropomorphism of the system overall, as well as higher ratings of accuracy of the information the system provides. Additionally, the first-person pronoun (“I”) led to higher information accuracy and reduced risk ratings, but only in one context. We discuss these findings for their implications for the design of responsible, human–generative AI experiences.
View details
Websites Need Your Permission Too – User Sentiment and Decision Making on Web Permission Prompts in Desktop Chrome
Marian Harbach
CHI 2024, ACM (to appear)
Preview abstract
The web utilizes permission prompts to moderate access to certain capabilities. We present the first investigation of user behavior and sentiment of this security and privacy measure on the web, using 28 days of telemetry data from more than 100M Chrome installations on desktop platforms and experience sampling responses from 25,706 Chrome users. Based on this data, we find that ignoring and dismissing permission prompts are most common for geolocation and notifications. Permission prompts are perceived as more annoying and interrupting when they are not allowed, and most respondents cite a rational reason for the decision they took. Our data also supports that the perceived availability of contextual information from the requesting website is associated with allowing access to a requested capability. More usable permission controls could facilitate adoption of best practices that address several of the identified challenges; and ultimately could lead to better user experiences and a safer web.
View details
Preview abstract
Recent efforts to address hallucinations in Large Language Models (LLMs) have focused on attributed text generation, which supplements generated texts with citations of supporting sources for post-generation fact-checking and corrections. Yet, these citations often point to entire documents or paragraphs, burdening users with extensive verification work. In this paper, we introduce a locally-attributable text generation approach, prioritizing concise attributions. Our method, named ``Attribute First, then Generate'', breaks down the conventional end-to-end generation process into three intuitive steps: content selection, sentence planning, and sequential sentence generation. By initially identifying relevant source segments (``select first'') and then conditioning the generation process on them (``then generate''), we ensure these segments also act as the output's fine-grained attributions (``select'' becomes ``attribute''). Tested on Multi-document Summarization and Long-form Question-answering, our method not only yields more concise citations than the baselines but also maintains - and in some cases enhances - both generation quality and attribution accuracy. Furthermore, it significantly reduces the time required for fact verification by human assessors.
View details
Multimodal Modeling for Spoken Language Identification
Shikhar Bharadwaj
Sriram (Sri) Ganapathy
Sid Dalmia
Wei Han
Yu Zhang
Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
Preview abstract
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
View details
Preview abstract
2022 marked the 50th anniversary of memory safety vulnerabilities, first reported by Anderson et al. Half a century later, we are still dealing with memory safety bugs despite substantial investments to improve memory unsafe languages.
Like others', Google’s data and internal vulnerability research show that memory safety bugs are widespread and one of the leading causes of vulnerabilities in memory-unsafe codebases. Those vulnerabilities endanger end users, our industry, and the broader society.
At Google, we have decades of experience addressing, at scale, large classes of vulnerabilities that were once similarly prevalent as memory safety issues. Based on this experience we expect that high assurance memory safety can only be achieved via a Secure-by-Design approach centered around comprehensive adoption of languages with rigorous memory safety guarantees.
We see no realistic path for an evolution of C++ into a language with rigorous memory safety guarantees that include temporal safety. As a consequence, we are considering a gradual transition of C++ code at Google towards other languages that are memory safe.
Given the large volume of pre-existing C++, we believe it is nonetheless necessary to improve the safety of C++ to the extent practicable. We are considering transitioning to a safer C++ subset, augmented with hardware security features like MTE.
View details
Knowledge Distillation with Perturbed Loss: From a Vanilla Teacher to a Proxy Teacher
Rongzhi Zhang
Chao Zhang
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024), ACM, pp. 4278 - 4289
Preview abstract
Knowledge distillation is a popular technique to transfer knowledge from a large teacher model to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution. Therefore, forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss implicitly transforms the original teacher into a proxy teacher with a distribution closer to the ground truth distribution. We establish the theoretical connection between this "distribution closeness'' and the student model generalizability, which enables us to select the PTLoss's perturbation coefficients in a principled way. Extensive experiments on six public benchmark datasets demonstrate the effectiveness of PTLoss with teachers of different scales.
View details
Large Scale K-Clustering
ACM Transactions on Knowledge Discovery from Data (2024)
Preview abstract
Large-scale learning algorithms are essential for modern data collections that may have billions of data points. Here we study the design of parallel $k$-clustering algorithms, which include the $k$-median, $k$-medoids, and $k$-means clustering problems. We design efficient parallel algorithms for these problems and prove that they still compute constant-factor approximations to the optimal solution for stable clustering instances. In addition to our theoretic results we present computational experiments that show that our $k$-median and $k$-means algorithms work well in practice - we are able to find better clusterings than state-of-the-art coreset constructions using samples of the same size.
View details
Preview abstract
Given a training data-set $\mathcal{S}$, and a reference data-set $\mathcal{T}$, we design a simple and efficient algorithm to reweigh the loss function such that the limiting distribution of the neural network weights that result from training on $\mathcal{S}$ approaches the limiting distribution that would have resulted by training on $\mathcal{T}$. Such reweighing can be used to correct for Train-Test distribution shift, when we don't have access to the labels of $\mathcal{T}$. It can also be used to perform (soft) multi-criteria optimization on neural nets, when we have access to the labels of $\mathcal{T}$, but $\mathcal{S}$ and $\mathcal{T}$ have few common points.
As a motivating application, we train a graph neural net to recognize small molecule binders to MNK2 (a MAP Kinase, responsible for cell signaling) which are non-binders to MNK1 (a very similar protein), even in the absence of training data common to both data-sets. We are able to tune the reweighing parameters so that overall change in holdout loss is negligible, but the selectivity, i.e., the fraction of top 100 MNK2 binders that are MNK1 non-binders, increases from 54\% to 95\%, as a result of our reweighing.
We expect the algorithm to be applicable in other settings as well, since we prove that when the metric entropy of the input data-sets is bounded, our random sampling based greedy algorithm outputs a close to optimal reweighing, i.e., the two invariant distributions of network weights will be provably close in total variation distance.
View details
Beyond dashboards: LLM-powered insights for next generation of business intelligence
AIM Research (2024)
Preview abstract
The articles delves into the promise of AI in business intelligence. It briefly reviews the evolution of BI and various Cloud tools, followed by the paradigm shift in how data is consumed. While AI brings huge potential, the article covers areas that enterprises must exercise caution over, when building intelligent agents to answer data questions.
View details
Preview abstract
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.
View details
General Identifiability and Achievability for Causal Representation Learning
Burak Varici
Emre Acarturk
Ali Tajer
AISTATS 2024 (Oral), Oral Talk at NeurIPS Causal Representation Learning Workshop 2023. (2024)
Preview abstract
This paper focuses on causal representation learning (CRL) under a general nonparametric latent causal model and a general transformation model that maps the latent data to the observational data. It establishes identifiability and achievability results using two hard uncoupled interventions per node in the latent causal graph. Notably, one does
not know which pair of intervention environments have the same node intervened (hence,
uncoupled). For identifiability, the paper establishes that perfect recovery of the latent
causal model and variables is guaranteed under uncoupled interventions. For achievability,
an algorithm is designed that uses observational and interventional data and recovers
the latent causal model and variables with provable guarantees. This algorithm leverages
score variations across different environments to estimate the inverse of the transformer and,
subsequently, the latent variables. The analysis, additionally, recovers the identifiability
result for two hard coupled interventions, that is when metadata about the pair of environments that have the same node intervened is known. This paper also shows that when observational data is available, additional faithfulness assumptions that are adopted by the existing literature are unnecessary
View details
Towards Conversational Diagnostic AI
Anil Palepu
Khaled Saab
Jan Freyberg
Ryutaro Tanno
Amy Wang
Brenna Li
Nenad Tomašev
Karan Singhal
Le Hou
Albert Webson
Kavita Kulkarni
Sara Mahdavi
Juro Gottweis
Joelle Barral
Kat Chou
Arxiv (2024) (to appear)
Preview abstract
At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
View details
Preview abstract
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.
View details
Preview abstract
Cloud computing architectures are more scalable and economical which is the main reason that has contributed to its popularity. However, they bring their own set of challenges when it comes to workload scheduling and resource utilization because virtual machines (VM) and applications have to share different types of resources like servers, storage, etc. Historically, other strategies for workload balancing and resource management include manual configuration or simplistic heuristics that do not provide effective optimizations of resource usage and performance. In this technical brief, we propose an approach built on the use of unsupervised learning techniques to detect usage patterns perceptively and improve resource utilization, which corresponds to both optimal performance and automatically balanced workload among VMs. We are making use of clustering algorithms to cluster similar workloads and then resource allocation for each group based on demand. The point of this step is to use the resources more effectively so we do not run into resource exhaustion. We also integrate anomaly detection methods within our system for identifying and handling abnormal behavior by both monitoring and placing resources. We experiment with region traces from production workloads to demonstrate the benefits of our approach, showing marked improvements in workload balancing and resource utilization over current practices.
View details