Badih Ghazi
            I am a Research Scientist in the Algorithms & Optimization Team at Google. Here's a link to my personal webpage 
          
        
        
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              How Unique is Whose Web Browser? The role of demographics in browser fingerprinting 
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Robin Lassonde
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            2025
          
          
        
        
        
          
              Preview abstract
          
          
              Web browser fingerprinting can be used to identify and track users across the Web, even without cookies, by collecting attributes from users' devices to create unique "fingerprints". This technique and resulting privacy risks have been studied for over a decade. Yet further research is limited because prior studies did not openly publish their data. Additionally, data in prior studies had biases and lacked user demographics.
Here we publish a first-of-its-kind open dataset that includes browser attributes with users' demographics, collected from 8,400 US study participants, with their informed consent. Our data collection process also conducted an experiment to study what impacts users' likelihood to share browser data for open research, in order to inform future data collection efforts, with survey responses from a total of 12,461 participants. Female participants were significantly less likely to share their browser data, as were participants who were shown the browser data we asked to collect.
In addition we demonstrate how fingerprinting risks differ across demographic groups.  For example, we find lower income users are more at risk, and find that as users' age increases, they are both more likely to be concerned about fingerprinting and at real risk of fingerprinting. Furthermore, we demonstrate an overlooked risk: user demographics, such as gender, age, income level, ethnicity and race, can be inferred from browser attributes commonly used for fingerprinting, and we identify which browser attributes most contribute to this risk.
Overall, we show the important role of user demographics in the ongoing work that intends to assess fingerprinting risks and improve user privacy, with findings to inform future privacy enhancing browser developments. The dataset and data collection tool we openly publish can be used to further study research questions not addressed in this work.
              
  
View details
          
        
      
    
        
          
            
              VaultGemma
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Lynn Chua
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Prem Eruvbetine
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Thomas Mesnard
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Borja De Balle Pigem
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daogao Liu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Amer Sinha
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yangsibo Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Christopher A. Choquette-Choo
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        George Kaissis
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Armand Joulin
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Da Yu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ryan McKenna
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            arxiv (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              In this work, we present VaultGemma 1B, a model based on the Gemma family of models fully trained with differential privacy. VaultGemma 1B is 1 billion parameter pretrained model based on the Gemma 2 series of models and uses the same dataset for training. We will be releasing a tech report and the weights of this model.
              
  
View details
          
        
      
    
        
          
            
              Scaling Embedding Layers in Language Models
            
          
        
        
          
            
              
                
                  
                    
    
    
    
        
         
          
  
Preview
        
    
  
                      
                        Da Yu
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yangsibo Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daogao Liu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
          
          
          
          
            2025
          
          
        
        
          
            
              On the Differential Privacy and Interactivity of Privacy Sandbox Reports
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Charlie Harrison
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexander Knop
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ethan Leeman
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vikas Sahu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            PETS (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities as part of the effort to limit third-party cookies. In particular, the Private Aggregation API (PAA) and the Attribution Reporting API (ARA) can be used for ad measurement while providing different guardrails for safeguarding user privacy, including a framework for satisfying differential privacy (DP). In this work, we provide an abstract model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee under certain assumptions. Our analysis handles the case where both the queries and database can change interactively based on previous responses from the API.
              
  
View details
          
        
      
    
        
          
            
              Quantifying Cross-Modality Memorization in Vision-Language Models
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Chiyuan Zhang
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Tom Goldstein
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuxin Wen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yangsibo Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Advances in Neural Information Processing Systems (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We study differential privacy (DP) in a multi-party setting where each party only trusts a (known) subset of the other parties with its data. Specifically, given a trust graph where vertices correspond to parties and neighbors are mutually trusting, we give a DP algorithm for aggregation with a much better privacy-utility trade-off than in the well-studied local model of DP (where each party trusts no other party). We further study a robust variant where each party trusts all but an unknown subset of at most t of its neighbors (where t is a given parameter), and give an algorithm for this setting. We complement our algorithms with lower bounds, and discuss implications of our work to other tasks in private learning and analytics.
              
  
View details
          
        
      
    
        
          
            
              Balls-and-Bins Sampling for DP-SGD
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Lynn Chua
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Charlie Harrison
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ethan Leeman
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Amer Sinha
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
          
          
          
          
            AISTATS (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regime of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained with Balls-and-Bins based DP-SGD achieve utility comparable to those trained with Shuffle based DP-SGD at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling.
              
  
View details
          
        
      
    
        
          
            
              Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Lynn Chua
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yangsibo Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Amer Sinha
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chulin Xie
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
          
          
          
          
            COLM (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is available at https://github.com/google-research/crosslingual-knowledge-barriers.
              
  
View details
          
        
      
    
        
          
            
              Differentially Private Insights into AI Use
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Daogao Liu
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexander Knop
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Adam Sealfon
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Da Yu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
          
          
          
          
            Conference on Language Modeling (COLM) 2025 (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              We introduce Urania, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private method inspired by CLIO (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.
              
  
View details
          
        
      
    
        
          
            
              How Private are DP-SGD Implementations?
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Lynn Chua
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pritish Kamath
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Amer Sinha
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chiyuan Zhang
                      
                    
                  
              
            
          
          
          
          
            ICML (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling-based DP-SGD is more commonly used in practical implementations, it has not been amenable to easy privacy analysis, either analytically or even numerically. On the other hand, Poisson subsampling-based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling-based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.
              
  
View details