Pasin Manurangsi

Pasin Manurangsi

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Balls-and-Bins Sampling for DP-SGD
    Lynn Chua
    Charlie Harrison
    Pritish Kamath
    Ethan Leeman
    Amer Sinha
    Chiyuan Zhang
    AISTATS (2025)
    Preview abstract We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regime of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained with Balls-and-Bins based DP-SGD achieve utility comparable to those trained with Shuffle based DP-SGD at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling. View details
    Preview abstract Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is available at https://github.com/google-research/crosslingual-knowledge-barriers. View details
    Preview abstract In the Max k-Weight SAT (aka Max SAT with Cardinality Constraint) problem, we are given a CNF formula with n variables and m clauses together with a positive integer k. The goal is to find an assignment where at most k variables are set to one that satisfies as many constraints as possible. Recently, Jain et al. (SODA 2023) gave an FPT approximation scheme (FPT-AS) with running time 2^O((dk/ε)^d) * (n + m)^O(1) for Max k-Weight SAT when the incidence graph is K_{d,d}-free. They asked whether a polynomial-size approximate kernel exists. In this work, we answer this question positively by giving an (1 − ε)-approximate kernel with (dk/ε)^O(d) variables. This also implies an improved FPT-AS with running time (dk/ε)^O(dk) * (n+m)^O(1)-time algorithm for the problem. Our approximate kernel is based mainly on a couple of greedy strategies together with a sunflower lemma-style reduction rule. View details
    Preview abstract The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities as part of the effort to limit third-party cookies. In particular, the Private Aggregation API (PAA) and the Attribution Reporting API (ARA) can be used for ad measurement while providing different guardrails for safeguarding user privacy, including a framework for satisfying differential privacy (DP). In this work, we provide an abstract model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee under certain assumptions. Our analysis handles the case where both the queries and database can change interactively based on previous responses from the API. View details
    Preview abstract When dividing items among agents, two of the most widely studied fairness notions are envy-freeness and proportionality. We consider a setting where m chores are allocated to n agents and the disutility of each chore for each agent is drawn from a probability distribution. We show that an envy-free allocation exists with high probability provided that m ≥ 2n, and moreover, m must be at least n + Θ(n) in order for the existence to hold. On the other hand, we prove that a proportional allocation is likely to exist as long as m = ω(1), and this threshold is asymptotically tight. Our results reveal a clear contrast with the allocation of goods, where a larger number of items is necessary to ensure existence for both notions. View details
    Preview abstract Differential privacy can be achieved in a distributed manner, where multiple parties add independent noise such that their sum protects the overall dataset with differential privacy. A common technique here is for each party to sample their noise from the decomposition of an infinitely divisible distribution. We introduce two novel mechanisms in this setting: 1) the generalized discrete Laplace (GDL) mechanism, whose distribution (which is closed under summation) follows from differences of i.i.d. negative binomial shares, and 2) The multi-scale discrete Laplace (MSDLap) mechanism, which follows the sum of multiple i.i.d. discrete Laplace shares at different scales. The mechanisms can be parameterized to have 𝑂(Δ^3𝑒^{−𝜀}) and 𝑂 (min(Δ^3𝑒^{−𝜀}, Δ^2𝑒^{−2𝜀/3})) MSE, respectively, where the latter bound matches known optimality results. Furthermore, the MSDLap mechanism has the optimal MSE including constants as 𝜀 → ∞. We also show a transformation from the discrete setting to the continuous setting, which allows us to transform both mechanisms to the continuous setting and thereby achieve the optimal 𝑂 (Δ^2𝑒^{−2𝜀/3}) MSE. To our knowledge, these are the first infinitely divisible additive noise mechanisms that achieve order-optimal MSE under pure differential privacy for either the discrete or continuous setting, so our work shows formally there is no separation in utility when query-independent noise adding mechanisms are restricted to infinitely divisible noise. For the continuous setting, our result improves upon Pagh and Stausholm’s Arete distribution which gives an MSE of 𝑂(Δ^2𝑒^{−𝜀/4}) [35]. We apply our results to improve a state of the art multi-message shuffle DP protocol from [3] in the high 𝜀 regime. View details
    Preview abstract We consider the differentially private (DP) facility location problem in the so called super-set output setting proposed by Gupta et al. [GLM+10]. The current best known expected approximation ratio for an ε-DP algorithm is O(log n / √ε) due to Cohen-Addad et al. [CEF+22] where n denote the size of the metric space, meanwhile the best known lower bound is Ω(1/√ε) [EGLW19]. In this short note, we give a lower bound of Ω(min{log n, √(log n/ε)}) on the expected approximation ratio of any ε-DP algorithm, which is the first evidence that the approximation ratio has to grow with the size of the metric space. View details
    Preview abstract Web browser fingerprinting can be used to identify and track users across the Web, even without cookies, by collecting attributes from users' devices to create unique "fingerprints". This technique and resulting privacy risks have been studied for over a decade. Yet further research is limited because prior studies did not openly publish their data. Additionally, data in prior studies had biases and lacked user demographics. Here we publish a first-of-its-kind open dataset that includes browser attributes with users' demographics, collected from 8,400 US study participants, with their informed consent. Our data collection process also conducted an experiment to study what impacts users' likelihood to share browser data for open research, in order to inform future data collection efforts, with survey responses from a total of 12,461 participants. Female participants were significantly less likely to share their browser data, as were participants who were shown the browser data we asked to collect. In addition we demonstrate how fingerprinting risks differ across demographic groups. For example, we find lower income users are more at risk, and find that as users' age increases, they are both more likely to be concerned about fingerprinting and at real risk of fingerprinting. Furthermore, we demonstrate an overlooked risk: user demographics, such as gender, age, income level, ethnicity and race, can be inferred from browser attributes commonly used for fingerprinting, and we identify which browser attributes most contribute to this risk. Overall, we show the important role of user demographics in the ongoing work that intends to assess fingerprinting risks and improve user privacy, with findings to inform future privacy enhancing browser developments. The dataset and data collection tool we openly publish can be used to further study research questions not addressed in this work. View details
    Dividing conflicting items fairly
    Ayumi Igarashi
    Hirotaka Yoneda
    IJCAI (2025)
    Preview abstract We study the allocation of indivisible goods under conflicting constraints, represented by a graph. In this framework, vertices correspond to goods and edges correspond to conflicts between a pair of goods. Each agent is allocated an independent set in the graph. In a recent work of [Kumar et al., 2024], it was shown that a maximal EF1 allocation exists for interval graphs and two agents with monotone valuations. We significantly extend this result by establishing that a maximal EF1 allocation exists for *any graph* when the two agents have monotone valuations. To compute such an allocation, we present a polynomial-time algorithm for additive valuations as well as a pseudo-polynomial time algorithm for monotone valuations. Moreover, we complement our findings by providing a counter example demonstrating a maximal EF1 allocation may not exist for three agents with monotone valuations. Additionally, we establish NP-hardness of determining the existence of such allocations for every fixed number n of agents. View details
    Preview abstract We study differential privacy (DP) in a multi-party setting where each party only trusts a (known) subset of the other parties with its data. Specifically, given a trust graph where vertices correspond to parties and neighbors are mutually trusting, we give a DP algorithm for aggregation with a much better privacy-utility trade-off than in the well-studied local model of DP (where each party trusts no other party). We further study a robust variant where each party trusts all but an unknown subset of at most t of its neighbors (where t is a given parameter), and give an algorithm for this setting. We complement our algorithms with lower bounds, and discuss implications of our work to other tasks in private learning and analytics. View details