June 10, 2026
Mónica Ribero, Research Scientist, Google Research
We introduce a method designed to confidently determine whether there is statistically significant evidence that two sets of data observations come from entirely different underlying distributions.
Machine unlearning allows AI systems to "forget" specific parts of their training data without the massive cost of retraining a model from scratch. This is essential for regulatory compliance (like GDPR’s "Right to be Forgotten"), AI safety, and model quality.
As models process increasingly massive and highly sensitive datasets, verifying machine unlearning has moved from theoretical ideal to a strict requirement, where developers must now mathematically prove privacy. However, because auditors often don’t have access to the model's internal workings or original training data, they must verify the system strictly by querying it and analyzing the output samples.
One method data scientists and researchers rely on for verification is two-sample testing, a statistical method that determines if two sets of data observations come from entirely different underlying distributions. For example, to verify unlearning, auditors might compare outputs from a model that never saw a specific record against a model that supposedly "forgot" it. If the outputs are statistically different within a defined threshold, the unlearning failed.
As models grow in size and complexity, two-sample testing and other statistical tools used for machine unlearning auditing become challenging to implement and they lose statistical power. To identify a real violation from random noise inherent in large-scale models, and with enough statistical significance, an auditor needs to extract a large number of samples. This makes real-world testing completely computationally very expensive..
To address this growing challenge, we introduce Regularized f-Divergence Kernel Tests, presented at AISTATS 2026, a new framework designed to make auditing ML models much more sensitive, flexible, and accurate. We theoretically prove that our tests naturally control for false positives for any sample size, and that the risk of false negatives reliably converges to zero as the number of available data samples increases.
Evaluating model safety often requires measuring the distance, or divergence, between two complex data sets. Different applications naturally require different notions of “distance”. While popular standard tools like maximum mean discrepancy (MMD) excel at detecting broad, global shifts across data (such as a model systematically generating brighter images than its counterpart), they often lack the necessary specificity to capture complex anomalies. For instance, if the addition of a specific person's data causes a model to generate a highly specific outlier output only when prompted in a very exact way — while having an equal distribution on all other samples — traditional MMD tests might completely overlook this local shift.
Also, most existing testing frameworks force researchers to make error-prone manual choices, such as picking the specific statistic best suited for either global or local shifts or tuning complex settings like kernel bandwidths and regularization parameters.
In a simple two-sample test between two two-dimensional distributions (above blue and red), MMD excels at detecting global shifts like differences in mean (left) but can miss localized differences such as outliers (middle) or non-smooth differences that require hyperparameter tuning such as setting a bandwidth parameter (right).
In addition to being hard in practice, two-sample testing as a verification method is flawed when verifying unlearning of ML models. Consider the example below showing how two models trained from scratch on the exact same data can produce different distributions. The blue distribution is the distribution of a model retrained without compromised data. However, its distribution is different from the standard (green) due to retraining with different batch sizes. This results in a false positive, indicating that the tested model is unsafe.
Using a two-sample test to verify unlearning yields false positives when the tested model has a different distribution that the standard the auditor is comparing to.
Furthermore, recent work shows that an AI model can never perfectly “forget” data just by tweaking its current settings; unless it re-traces every step of its original training, it will always leave behind a permanent footprint of the information it was supposed to delete. Accordingly, achieving perfect “retrain equivalence” is fundamentally impossible for standard, local unlearning algorithms and a traditional two-sample test can always find a dependence on the “forget set”.
We resolve this challenge by proposing a relative distance test that measures whether an unlearned model is distributionally closer to a safely retrained model or to the original, compromised one.
Our test acts as a highly adaptable statistical toolkit that leverages f-divergences to allow auditors to pinpoint highly specific types of data shifts, including:
Calculating these divergences on high-dimensional, real-world data is notoriously difficult. To make these complex optimization problems tractable without requiring massive amounts of compute, we use kernel regularization methods to estimate the differences efficiently.
Our adaptive testing approach automatically selects the best divergence and the optimal hyperparameter configurations to maximize the reliability of the test, entirely eliminating the need for sample splitting.
Because our proposed tests are general, we experimented across a wide variety of problems. We evaluated our framework on perturbed uniforms (synthetic two-sample benchmarks), as well as the Expo1D outlier detection task within physics datasets — a specialized area that uses ML to search for new physical phenomena outside the standard model of particle physics. We used high-energy physics data because that field requires the world’s most precise "difference detectors” — the idea being, if the framework can spot a rare particle that defies the laws of physics, it can spot a tiny privacy leak in an AI model.
We then shifted our primary focus to the critical, real-world applications of auditing differential privacy and evaluating machine unlearning:
Proposed framework for relative distance. If the tested model is closer to the compromised model than the retrained golden standard, the test flags an unlearning failure. If the tested model is closer to the golden standard, then the test doesn’t flag any failures.
Our framework successfully recovered or outperformed all previous baseline methods with significantly less manual tuning.
The experimental results demonstrated that no single test consistently outperforms the others across every possible scenario. Instead, different f-divergences act as specialized sensors that "light up" for different types of localized data shifts. By using an aggregated approach across diverse statistics, our framework successfully caught subtle errors and anomalies that standard tests completely missed.
For privacy auditing, the hockey-stick divergence test proved to be a powerful and effective tool. Because it directly aligns with the mathematical foundations of pure differential privacy, it allows auditors to tightly control the acceptable degree of data shift. Our adaptive testing framework successfully caught privacy violations using significantly fewer data samples and requiring far less hyperparameter tuning than previous baseline testers.
Detection rate of non-private mechanisms (from standard auditing benchmarks). Our hockey-stick based tester outperforms previously studied techniques (DP-Auditorium) with fewer samples.
In one notable instance, our framework detected violations in a specific sparse vector technique mechanism (SVT3) using only a few thousand samples, while previously studied techniques like DP-Auditorium required millions of samples to approximate the same violation detection rate.
Our findings also suggest a redefinition of how to evaluate machine unlearning. As shown in the table below, we observed that none of the approximate unlearning methods we evaluated were compliant with the strict, standard two-sample unlearning definition. Because two-sample tests simply look for any distributional difference, they incorrectly flagged perfectly safe, retrained models as unlearning failures.
In contrast, our proposed relative three-sample test successfully overcame this flaw. It correctly and consistently identified the safely retrained models as "safe". When evaluating the approximate unlearning algorithms, only the random label technique passed the evaluation.
Other popular methods, such as finetuning, pruning, and Selective Synaptic Dampening, were found to be ineffective at truly forgetting the targeted data. We emphasize that our primary goal in these experiments was the evaluation of the unlearning methodologies, rather than designing the algorithms themselves. Consequently, we used simplified implementations of these unlearning procedures; more rigorous setups will be required to rank unlearning methods in practical production environments.
Audit results for different (simplified) unlearning algorithms. Exact unlearning mechanisms retrain from scratch without access to forget data, and are thus safe by definition. However, two-sample tests incorrectly flag them as unsafe due to distributional differences with the “standard”. The three-sample test overcomes this issue.
Our newly proposed framework provides a much more precise, adaptable, and mathematically sound lens for examining ML behavior. By leveraging regularized f-Divergence kernel tests, researchers and auditors can now statistically prove whether a model is behaving unsafely or leaking data across a massive class of problems and complex distributional shifts.
As this field evolves, theoretically grounding our empirical observations to characterize exactly which specific divergence is optimal for other novel tasks remains an exciting direction for future work. Establishing tighter sample complexity bounds will also be a key focus to make these audits even more efficient.
The work described here was done jointly with Antonin Schrab and Arthur Gretton. We thank Nicole Mitchell and Eleni Triantafillou for insightful feedback, and Kimberly Schwede for the graphics and Mark Simborg for helpful edits.