Badih Ghazi

I am a Research Scientist in the Algorithms & Optimization Team at Google. Here's a link to my personal webpage
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Differentially Private Insights into AI Use
    Daogao Liu
    Pritish Kamath
    Alexander Knop
    Adam Sealfon
    Da Yu
    Chiyuan Zhang
    Conference on Language Modeling (COLM) 2025 (2025)
    Preview abstract We introduce Urania, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private method inspired by CLIO (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation. View details
    Preview abstract User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance in safeguarding user privacy in large-scale machine learning applications. Current methods, such as those based on Differentially Private Stochastic Gradient Descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the geometric median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility balance but also maintains computational efficiency. This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field. View details
    Quantifying Cross-Modality Memorization in Vision-Language Models
    Chiyuan Zhang
    Tom Goldstein
    Yuxin Wen
    Yangsibo Huang
    Advances in Neural Information Processing Systems (2025)
    Preview abstract Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability. View details
    Scaling Embedding Layers in Language Models
    Da Yu
    Yangsibo Huang
    Pritish Kamath
    Daogao Liu
    Chiyuan Zhang
    2025
    Preview
    Balls-and-Bins Sampling for DP-SGD
    Lynn Chua
    Charlie Harrison
    Pritish Kamath
    Ethan Leeman
    Amer Sinha
    Chiyuan Zhang
    AISTATS (2025)
    Preview abstract We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regime of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained with Balls-and-Bins based DP-SGD achieve utility comparable to those trained with Shuffle based DP-SGD at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling. View details
    Preview abstract The conventional approach in differential privacy (DP) literature formulates the privacy-utility tradeoff with a "privacy-first" perspective: for a predetermined level of privacy, a certain utility is achievable. However, practitioners often operate under a "utility-first" paradigm, prioritizing a desired level of utility and then determining the corresponding privacy cost. Wu et al. [2019] initiated a formal study of this ``utility-first'' perspective by introducing ex-post DP. They demonstrated that by adding correlated Laplace noise and progressively reducing it on demand, a sequence of increasingly accurate estimates of a private parameter can be generated, with the privacy cost attributed only to the least noisy iterate released. This led to a Laplace mechanism variant that achieves a specified utility with minimal privacy loss. However, their work, and similar findings by Whitehouse et al. [2023], are primarily limited to simple mechanisms based on Laplace or Gaussian noise. In this paper, we significantly generalize these results. In particular, we extend the findings of Wu et al. [2019] and Liu and Talwar [2019] to support any sequence of private estimators, incurring at most a doubling of the original privacy budget. Furthermore, we demonstrate that hyperparameter tuning for these estimators, including the selection of an optimal privacy budget, can be performed without additional privacy cost. Finally, we extend our results to ex-post R'{e}nyi DP, further broadening the applicability of utility-first privacy mechanisms. View details
    Preview abstract We study differential privacy (DP) in a multi-party setting where each party only trusts a (known) subset of the other parties with its data. Specifically, given a trust graph where vertices correspond to parties and neighbors are mutually trusting, we give a DP algorithm for aggregation with a much better privacy-utility trade-off than in the well-studied local model of DP (where each party trusts no other party). We further study a robust variant where each party trusts all but an unknown subset of at most t of its neighbors (where t is a given parameter), and give an algorithm for this setting. We complement our algorithms with lower bounds, and discuss implications of our work to other tasks in private learning and analytics. View details
    PREM: Privately Answering Statistical Queries with Relative Error
    Sushant Sachdeva
    Cristóbal Guzmán
    Alexander Knop
    Pritish Kamath
    COLT (2025)
    Preview abstract We introduce PREM (Private Relative Error Multiplicative weight update) a new (ε, δ)-DP mechanism for privately generating synthetic data that achieves a relative error guarantee for linear queries. Namely, for a domain X, a family F of queries f : X → {0, 1}, and ζ > 0, the mechanism on input dataset D ∈ X^n outputs a synthetic dataset D ̃ ∈ X^n such that all linear queries in F on D, namely sum_x∈D f(x) for f ∈ F, are within a 1 ± ζ multiplicative factor of the corresponding value on D ̃ up to an additive error that is polynomial in log |F|, log |X |, log n, log(1/δ), 1/ε, and 1/ζ. This is in contrast to the standard additive-error only setting considered in the literature, which is known to require polynomial error in at least one of n, |F|, or |X |. We complement the algorithm with nearly matching lower bounds. View details
    Preview abstract Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is available at https://github.com/google-research/crosslingual-knowledge-barriers. View details
    Preview abstract The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities as part of the effort to limit third-party cookies. In particular, the Private Aggregation API (PAA) and the Attribution Reporting API (ARA) can be used for ad measurement while providing different guardrails for safeguarding user privacy, including a framework for satisfying differential privacy (DP). In this work, we provide an abstract model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee under certain assumptions. Our analysis handles the case where both the queries and database can change interactively based on previous responses from the API. View details
    ×