# Tamas Sarlos

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Chefs’ Random Tables: Non-Trigonometric Random Features

Valerii Likhosherstov

Avinava Dubey

Frederick Liu

Adrian Weller

NeurIPS 2022 (2022) (to appear)

Preview abstract
We present \textit{chefs' random tables} (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive -- a critical requirement for prominent applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the FAVOR++ mechanism, is to our knowledge the first RF method for unbiased softmax kernel estimation with positive \& bounded RFs, resulting in strong concentration results characterized by exponentially small tails, and much lower variance. As we show, orthogonal random features applied in FAVOR++ provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We exhaustively test CRTs and FAVOR++ on tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new SOTA for low-rank text Transformers, while providing linear space \& time complexity.
View details

Rethinking Attention with Performers

Valerii Likhosherstov

David Martin Dohan

Peter Hawkins

Jared Quincy Davis

Lukasz Kaiser

Adrian Weller

accepted to ICLR 2021 (oral presentation) (to appear)

Preview abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
View details

Stochastic Flows and Geometric Optimization on the Orthogonal Group

David Cheikhi

Jared Davis

Valerii Likhosherstov

Achille Nazaret

Achraf Bahamou

Xingyou Song

Mrugank Akarte

Jack Parker-Holder

Jacob Bergquist

Yuan Gao

Aldo Pacchiano

Adrian Weller

Thirty-seventh International Conference on Machine Learning (ICML 2020) (to appear)

Preview abstract
We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks.
View details

Orthogonal Estimation of Wasserstein Distances

Mark Rowland

Jiri Hron

Yunhao Tang

Adrian Weller

The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019)

Preview abstract
Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning.
View details

Orienteering Algorithms for Generating Travel Itineraries

Zachary Friggstad

Chaitanya Swamy

WSDM (2018)

Preview abstract
We study the problem of automatically and efficiently generating itineraries
for users who are on vacation. We focus on the common case, wherein the trip
duration is more than a single day. Previous efficient algorithms based on
greedy heuristics suffer from two problems. First, the itineraries are often
unbalanced, with excellent days visiting top attractions followed by days of
exclusively lower-quality alternatives. Second, the trips often re-visit
neighborhoods repeatedly in order to cover increasingly low-tier points of
interest. Our primary technical contribution is an algorithm that addresses
both these problems by maximizing the quality of the worst day. We give
theoretical results showing that this algorithm's competitive factor is within
a factor two of the guarantee of the best available algorithm for a single day,
across many variations of the problem. We also give detailed empirical
evaluations using two distinct datasets: (a) anonymized Google historical visit
data and (b) Foursquare public check-in data. We show first that the overall
utility of our itineraries is almost identical to that of algorithms
specifically designed to maximize total utility, while the utility of the worst
day of our itineraries is roughly twice that obtained from other approaches.
We then turn to evaluation based on human raters who score our itineraries only
slightly below the itineraries created by human travel experts with deep
knowledge of the area.
View details

Geometrically Coupled Monte Carlo Sampling

Mark Rowland

François Chalus

Aldo Pacchiano

Richard E. Turner

Adrian Weller

Advances in Neural Information Processing Systems 31 (NIPS 2018)

Preview abstract
Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including QMC, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling.
View details

The Geometry of Random Features

Mark Rowland

Richard Turner

Adrian Weller

International Conference on Artificial Intelligence and Statistics (AISTATS) (2018)

Preview abstract
We present an in-depth examination of the effectiveness of radial basis function kernel (beyond
Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators
outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the
associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform over iid methods can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal kernel estimators.
Furthermore, we provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms.
View details

Caching with Dual Costs

Anirban Dasgupta

Proceedings of the 26th International Conference on World Wide Web Companion (2017), pp. 643-652

Preview abstract
Caching mechanisms in distributed and social settings face the issue that the items can frequently change, requiring the cached ver- sions to be updated to maintain coherence. There is thus a trade-off between incurring cache misses on read requests and cache hits on update requests. Motivated by this we consider the following dual cost variant of the classical caching problem: each request for an item can be either a read or a write. If the request is read and the item is not in the cache, then a read-miss cost is incurred and if the request is write and the item is in the cache, then a write-hit cost is incurred. The goal is to design a caching algorithm that minimizes the sum of read-miss and write-hit costs. We study online and offline algorithms for this problem.
For the online version of the problem, we obtain an efficient algorithm whose cost is provably close to near-optimal cost. This algorithm builds on online algorithms for classical caching and metrical task systems, using them as black boxes. For the offline ver- sion, we obtain an optimal deterministic algorithm that is based on a minimum cost flow. Experiments on real and synthetic data show that our online algorithm incurs much less cost compared to natural baselines, while utilizing cache even better; furthermore, they also show that the online algorithm is close to the offline optimum.
View details

Structured adaptive and random spinners for fast machine learning computations

Mariusz Bojarski

Anna Choromanska

Francois Fagan

Cedric Gouy-Pailler

Anne Morvan

Nourhan Sakr

Jamal Atif

AISTATS (2017)

Preview abstract
We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms, new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the HD3HD2HD1 structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications.
View details

On Estimating the Average Degree

Anirban Dasgupta

Ravi Kumar

23rd International World Wide Web Conference, WWW '14, ACM (2014) (to appear)

Preview abstract
Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for
this problem that relies on access to edge samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume a uniform random sample of nodes is available and the number of nodes is known.
View details

Permutation Indexing: Fast Approximate Retrieval from Large Corpora

Maxim Gurevich

22nd International Conference on Information and Knowledge Management (CIKM), ACM (2013)

Preview abstract
Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques:
partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and
a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids.
Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.
View details

Fastfood - Approximating Kernel Expansions in Loglinear Time

Alex Smola

30th International Conference on Machine Learning (ICML), Omnipress (2013)

Preview abstract
Fast nonlinear function classes are crucial for nonparametric estimation, such as in kernel methods. This paper proposes an improvement to random kitchen sinks that offers significantly faster computation in log-linear time without sacrificing accuracy. Furthermore, we show how one may adjust the regularization properties of the kernel simply by changing the spectral distribution of the projection matrix. We provide experimental results which show that even for for moderately small problems we already achieve two orders of magnitude faster computation and three orders of magnitude lower memory footprint.
View details

Optimal Hashing Schemes for Entity Matching

Nilesh Dalvi

Vibhor Rastogi

Anirban Dasgupta

Anish Das Sarma

22nd International World Wide Web Conference, WWW '13, ACM, Rio de Janeiro, Brazil (2013), pp. 295-306

Preview abstract
In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously.
In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.
View details

No Results Found