Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(\xi^{-1} \log(1/\beta))$ (for generalization error $\xi$ with confidence $1-\beta$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\widetilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\widetilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tilde{\Theta}(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications. View details
    Preview abstract CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of $\ell_2$-heavy hitters (keys $i$ where $v_i^2 > \epsilon \|\boldsymbol{v}\|_2^2$) and approximate inner products. When the inputs are {\em not adaptive} (do not depend on prior outputs), classic estimators applied to a sketch of size $O(\ell/\epsilon)$ are accurate for a number of queries that is exponential in $\ell$. When inputs are adaptive, however, an adversarial input can be constructed after $O(\ell)$ queries with the classic estimator and the best known robust estimator only supports $\tilde{O}(\ell^2)$ queries. In this work we show that this quadratic dependence is in a sense inherent: We design an attack that after $O(\ell^2)$ queries produces an adversarial input vector whose sketch is highly biased. Our attack uses "natural" non-adaptive inputs (only the final adversarial input is chosen adaptively) and universally applies with any correct estimator, including one that is unknown to the attacker. In that, we expose inherent vulnerability of this fundamental method. View details
    Preview abstract Composition theorems are general and powerful tools that facilitate privacy accounting across multiple data accesses from per-access privacy bounds. However they often result in weaker bounds compared with end-to-end analysis. Two popular tools that mitigate that are the exponential mechanism (or report noisy max) and the sparse vector technique. They were generalized in a couple of recent private selection/test frameworks, including the work by Liu and Talwar (STOC 2019), and Papernot and Steinke (ICLR 2022). In this work, we first present an alternative framework for private selection and testing with a simpler privacy proof and equally-good utility guarantee. Second, we observe that the private selection framework (both previous ones and ours) can be applied to improve the accuracy/confidence trade-off for many fundamental privacy-preserving data-analysis tasks, including query releasing, top-k selection, and stable selection. Finally, for online settings, we apply the private testing to design a mechanism for adaptive query releasing, which improves the sample complexity dependence on the confidence parameter for the celebrated private multiplicative weights algorithm of Hardt and Rothblum (FOCS 2010). View details
    Preview abstract \texttt{CountSketch} is a popular dimensionality reduction technique that maps vectors to a lower-dimension using a randomized set of linear measurements. The sketch has the property that the $\ell_2$-heavy hitters of a vector (entries with $v_i^2 \geq \frac{1}{k}\|\boldsymbol{v}\|^2_2$) can be recovered from its sketch. We study the robustness of the sketch in adaptive settings, such as online optimization, where input vectors may depend on the output from prior inputs. We show that the classic estimator can be attacked with a number of queries of the order of the sketch size and propose a robust estimator (for a slightly modified sketch) that allows for quadratic number of queries. We improve robustness by a factor of $\sqrt{k}$ (for $k$ heavy hitters) over prior approaches. View details
    Chefs’ Random Tables: Non-Trigonometric Random Features
    Valerii Likhosherstov
    Avinava Dubey
    Frederick Liu
    Adrian Weller
    NeurIPS 2022(2022) (to appear)
    Preview abstract We present \textit{chefs' random tables} (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive -- a critical requirement for prominent applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the FAVOR++ mechanism, is to our knowledge the first RF method for unbiased softmax kernel estimation with positive \& bounded RFs, resulting in strong concentration results characterized by exponentially small tails, and much lower variance. As we show, orthogonal random features applied in FAVOR++ provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We exhaustively test CRTs and FAVOR++ on tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new SOTA for low-rank text Transformers, while providing linear space \& time complexity. View details
    Preview abstract Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that sanitizes a weighted sample as to ensure element-level differential privacy, while retaining its utility to the maximum extent possible. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art even for the well-studied special case of private histograms, when no sampling is performed. We empirically observe significant performance gains of 20%-300% increase in key reporting for common Zipfian frequency distributions and accurate estimation with X2-8 lower frequencies. PWS is applied as a post-processing of a non-private sample, without requiring the original data. Therefore, it can be a seamless addition to existing implementations, such as those optimizes for distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired. View details
    Rethinking Attention with Performers
    Valerii Likhosherstov
    David Martin Dohan
    Xingyou Song
    Peter Hawkins
    Jared Quincy Davis
    Afroz Mohiuddin
    Lukasz Kaiser
    Adrian Weller
    accepted to ICLR 2021 (oral presentation) (to appear)
    Preview abstract We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers. View details
    Stochastic Flows and Geometric Optimization on the Orthogonal Group
    David Cheikhi
    Jared Davis
    Valerii Likhosherstov
    Achille Nazaret
    Achraf Bahamou
    Xingyou Song
    Mrugank Akarte
    Jack Parker-Holder
    Jacob Bergquist
    Yuan Gao
    Aldo Pacchiano
    Adrian Weller
    Thirty-seventh International Conference on Machine Learning (ICML 2020) (to appear)
    Preview abstract We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks. View details
    Orthogonal Estimation of Wasserstein Distances
    Mark Rowland
    Jiri Hron
    Yunhao Tang
    Adrian Weller
    The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019)
    Preview abstract Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning. View details
    Preview abstract We study the problem of automatically and efficiently generating itineraries for users who are on vacation. We focus on the common case, wherein the trip duration is more than a single day. Previous efficient algorithms based on greedy heuristics suffer from two problems. First, the itineraries are often unbalanced, with excellent days visiting top attractions followed by days of exclusively lower-quality alternatives. Second, the trips often re-visit neighborhoods repeatedly in order to cover increasingly low-tier points of interest. Our primary technical contribution is an algorithm that addresses both these problems by maximizing the quality of the worst day. We give theoretical results showing that this algorithm's competitive factor is within a factor two of the guarantee of the best available algorithm for a single day, across many variations of the problem. We also give detailed empirical evaluations using two distinct datasets: (a) anonymized Google historical visit data and (b) Foursquare public check-in data. We show first that the overall utility of our itineraries is almost identical to that of algorithms specifically designed to maximize total utility, while the utility of the worst day of our itineraries is roughly twice that obtained from other approaches. We then turn to evaluation based on human raters who score our itineraries only slightly below the itineraries created by human travel experts with deep knowledge of the area. View details
    Geometrically Coupled Monte Carlo Sampling
    Mark Rowland
    François Chalus
    Aldo Pacchiano
    Richard E. Turner
    Adrian Weller
    Advances in Neural Information Processing Systems 31 (NIPS 2018)
    Preview abstract Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including QMC, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling. View details
    The Geometry of Random Features
    Mark Rowland
    Richard Turner
    Adrian Weller
    International Conference on Artificial Intelligence and Statistics (AISTATS)(2018)
    Preview abstract We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform over iid methods can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal kernel estimators. Furthermore, we provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms. View details
    Caching with Dual Costs
    Anirban Dasgupta
    Ravi Kumar
    Proceedings of the 26th International Conference on World Wide Web Companion(2017), pp. 643-652
    Preview abstract Caching mechanisms in distributed and social settings face the issue that the items can frequently change, requiring the cached ver- sions to be updated to maintain coherence. There is thus a trade-off between incurring cache misses on read requests and cache hits on update requests. Motivated by this we consider the following dual cost variant of the classical caching problem: each request for an item can be either a read or a write. If the request is read and the item is not in the cache, then a read-miss cost is incurred and if the request is write and the item is in the cache, then a write-hit cost is incurred. The goal is to design a caching algorithm that minimizes the sum of read-miss and write-hit costs. We study online and offline algorithms for this problem. For the online version of the problem, we obtain an efficient algorithm whose cost is provably close to near-optimal cost. This algorithm builds on online algorithms for classical caching and metrical task systems, using them as black boxes. For the offline ver- sion, we obtain an optimal deterministic algorithm that is based on a minimum cost flow. Experiments on real and synthetic data show that our online algorithm incurs much less cost compared to natural baselines, while utilizing cache even better; furthermore, they also show that the online algorithm is close to the offline optimum. View details
    Structured adaptive and random spinners for fast machine learning computations
    Mariusz Bojarski
    Anna Choromanska
    Francois Fagan
    Cedric Gouy-Pailler
    Anne Morvan
    Nourhan Sakr
    Jamal Atif
    Preview abstract We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms, new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the HD3HD2HD1 structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications. View details
    On Sampling Nodes in a Network
    Flavio Chierichetti
    Anirban Dasgupta
    Ravi Kumar
    WWW(2016) (to appear)
    On Estimating the Average Degree
    Anirban Dasgupta
    Ravi Kumar
    23rd International World Wide Web Conference, WWW '14, ACM(2014) (to appear)
    Preview abstract Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for this problem that relies on access to edge samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume a uniform random sample of nodes is available and the number of nodes is known. View details
    Optimal Hashing Schemes for Entity Matching
    Nilesh Dalvi
    Vibhor Rastogi
    Anirban Dasgupta
    Anish Das Sarma
    22nd International World Wide Web Conference, WWW '13, ACM, Rio de Janeiro, Brazil(2013), pp. 295-306
    Preview abstract In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously. In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies. View details
    Fastfood - Approximating Kernel Expansions in Loglinear Time
    Alex Smola
    30th International Conference on Machine Learning (ICML), Omnipress(2013)
    Preview abstract Fast nonlinear function classes are crucial for nonparametric estimation, such as in kernel methods. This paper proposes an improvement to random kitchen sinks that offers significantly faster computation in log-linear time without sacrificing accuracy. Furthermore, we show how one may adjust the regularization properties of the kernel simply by changing the spectral distribution of the projection matrix. We provide experimental results which show that even for for moderately small problems we already achieve two orders of magnitude faster computation and three orders of magnitude lower memory footprint. View details
    Permutation Indexing: Fast Approximate Retrieval from Large Corpora
    Maxim Gurevich
    22nd International Conference on Information and Knowledge Management (CIKM), ACM(2013)
    Preview abstract Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries. View details
    Fastfood-computing hilbert space expansions in loglinear time
    Alex Smola
    International Conference on Machine Learning(2013) (to appear)
    Preview abstract Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matrices when combined with diagonal Gaussian matrices exhibit properties similar to dense Gaussian random matrices. Yet unlike the latter, Hadamard and diagonal matrices are inexpensive to multiply and store. These two matrices can be used in lieu of Gaussian matrices in Random Kitchen Sinks (Rahimi & Recht, 2007) and thereby speeding up the computation for a large range of kernel functions. Specifically, Fastfood requires O(n log d) time and O(n) storage to compute n non-linear basis functions in d dimensions, a significant improvement from O(nd) computation and storage, without sacrificing accuracy. We prove that the approximation is unbiased and has low variance. Extensive experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction. View details
    No Results Found