Ravi Kumar

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities as part of the effort to limit third-party cookies. In particular, the Private Aggregation API (PAA) and the Attribution Reporting API (ARA) can be used for ad measurement while providing different guardrails for safeguarding user privacy, including a framework for satisfying differential privacy (DP). In this work, we provide an abstract model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee under certain assumptions. Our analysis handles the case where both the queries and database can change interactively based on previous responses from the API. View details
    Preview abstract Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is available at https://github.com/google-research/crosslingual-knowledge-barriers. View details
    Quantifying Cross-Modality Memorization in Vision-Language Models
    Chiyuan Zhang
    Tom Goldstein
    Yuxin Wen
    Yangsibo Huang
    Advances in Neural Information Processing Systems (2025)
    Preview abstract Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability. View details
    Preview abstract We study differential privacy (DP) in a multi-party setting where each party only trusts a (known) subset of the other parties with its data. Specifically, given a trust graph where vertices correspond to parties and neighbors are mutually trusting, we give a DP algorithm for aggregation with a much better privacy-utility trade-off than in the well-studied local model of DP (where each party trusts no other party). We further study a robust variant where each party trusts all but an unknown subset of at most t of its neighbors (where t is a given parameter), and give an algorithm for this setting. We complement our algorithms with lower bounds, and discuss implications of our work to other tasks in private learning and analytics. View details
    Balls-and-Bins Sampling for DP-SGD
    Lynn Chua
    Charlie Harrison
    Pritish Kamath
    Ethan Leeman
    Amer Sinha
    Chiyuan Zhang
    AISTATS (2025)
    Preview abstract We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regime of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained with Balls-and-Bins based DP-SGD achieve utility comparable to those trained with Shuffle based DP-SGD at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling. View details
    VaultGemma
    Lynn Chua
    Prem Eruvbetine
    Chiyuan Zhang
    Thomas Mesnard
    Borja De Balle Pigem
    Daogao Liu
    Amer Sinha
    Pritish Kamath
    Yangsibo Huang
    Christopher A. Choquette-Choo
    George Kaissis
    Armand Joulin
    Da Yu
    Ryan McKenna
    arxiv (2025)
    Preview abstract In this work, we present VaultGemma 1B, a model based on the Gemma family of models fully trained with differential privacy. VaultGemma 1B is 1 billion parameter pretrained model based on the Gemma 2 series of models and uses the same dataset for training. We will be releasing a tech report and the weights of this model. View details
    Differentially Private Insights into AI Use
    Daogao Liu
    Pritish Kamath
    Alexander Knop
    Adam Sealfon
    Da Yu
    Chiyuan Zhang
    Conference on Language Modeling (COLM) 2025 (2025)
    Preview abstract We introduce Urania, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private method inspired by CLIO (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation. View details
    Linear Elastic Caching via Ski Rental
    Todd Lipcon
    The biennial Conference on Innovative Data Systems Research (2025)
    Preview abstract In this work we study the Linear Elastic Caching problem, where the goal is to minimize the total cost of a cache inclusive of not just its misses, but also its memory footprint integrated over time. We demonstrate a theoretical connection to the classic ski rental problem and propose a practical algorithm that combines online caching algorithms with ski rental policies. We also introduce a lightweight machine learning-based algorithm for ski rental that is optimized for production workloads and is easy to integrate within existing database systems. Evaluations on both production workloads in Google Spanner and publicly available traces show that the proposed elastic caching approach can significantly reduce the total cache cost compared to traditional fixed-size cache policies. View details
    Preview abstract Web browser fingerprinting can be used to identify and track users across the Web, even without cookies, by collecting attributes from users' devices to create unique "fingerprints". This technique and resulting privacy risks have been studied for over a decade. Yet further research is limited because prior studies did not openly publish their data. Additionally, data in prior studies had biases and lacked user demographics. Here we publish a first-of-its-kind open dataset that includes browser attributes with users' demographics, collected from 8,400 US study participants, with their informed consent. Our data collection process also conducted an experiment to study what impacts users' likelihood to share browser data for open research, in order to inform future data collection efforts, with survey responses from a total of 12,461 participants. Female participants were significantly less likely to share their browser data, as were participants who were shown the browser data we asked to collect. In addition we demonstrate how fingerprinting risks differ across demographic groups. For example, we find lower income users are more at risk, and find that as users' age increases, they are both more likely to be concerned about fingerprinting and at real risk of fingerprinting. Furthermore, we demonstrate an overlooked risk: user demographics, such as gender, age, income level, ethnicity and race, can be inferred from browser attributes commonly used for fingerprinting, and we identify which browser attributes most contribute to this risk. Overall, we show the important role of user demographics in the ongoing work that intends to assess fingerprinting risks and improve user privacy, with findings to inform future privacy enhancing browser developments. The dataset and data collection tool we openly publish can be used to further study research questions not addressed in this work. View details
    Training Differentially Private Ad Prediction Models with Semi-Sensitive Features
    Lynn Chua
    Qiliang Cui
    Charlie Harrison
    Pritish Kamath
    Walid Krichene
    Nicolas Mayoraz
    Krishna Giri Narra
    Amer Sinha
    Avinash Varadarajan
    Chiyuan Zhang
    AdKDD@KDD (2024)
    Preview abstract Motivated by problems arising in digital advertising, we study the task of training differentially private (DP) machine learning models with semi-sensitive features. In this setting, a subset of the features is known to the attacker (and thus need not be protected) while the remaining features as well as the label are unknown to the attacker and should be protected by the DP guarantee. This task interpolates between training the model with full DP (where the label and all features should be protected) or with label DP (where all the features are considered known, and only the label should be protected). We present a new algorithm for training DP models with semi-sensitive features. Through an empirical evaluation on ads datasets, we demonstrate that our algorithm surpasses in utility the baselines of (i) DP stochastic gradient descent (DP-SGD) run on all features (known and unknown), and (ii) a label DP algorithm run only on the known features (while discarding the unknown ones). View details