Avinava Dubey
Research Areas
Authored Publications
Sort By
A Fourier Approach to Mixture Learning
Mingda Qiao
Guru Prashanth Guruganesh
Conference on Neural Information Processing Systems (2022) (to appear)
Preview abstract
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\N(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \R^d$ up to a small error. The hardness of this learning problem can be measured by the \emph{separation} $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\poly(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$.
In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. This was previously open even in one dimension. More generally, our algorithm runs in time $\poly(k)\cdot f(d, \Delta, \eps)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\eps$.
Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.
View details
Chefs’ Random Tables: Non-Trigonometric Random Features
Valerii Likhosherstov
Frederick Liu
Adrian Weller
NeurIPS 2022 (2022) (to appear)
Preview abstract
We present \textit{chefs' random tables} (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive -- a critical requirement for prominent applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the FAVOR++ mechanism, is to our knowledge the first RF method for unbiased softmax kernel estimation with positive \& bounded RFs, resulting in strong concentration results characterized by exponentially small tails, and much lower variance. As we show, orthogonal random features applied in FAVOR++ provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We exhaustively test CRTs and FAVOR++ on tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new SOTA for low-rank text Transformers, while providing linear space \& time complexity.
View details
LEARNING THE TRANSFORMER KERNEL
Sankalan Chowdhury
Adamos Solomou
Mrinmaya Sachan
Transactions on Machine Learning Research (2022)
Preview abstract
In this work we introduce KL-Transformer, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of the Transformers from quadratic to linear. We show that KL-Transformers achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.
View details
DAG-structured Clustering by Nearest-Neighbors
Nicholas Monath
Andrew McCallum
International Conference on Artificial Intelligence and Statistics (2021)
Preview abstract
Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple alternative clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structure is not only more flexible than a tree but also allows for points to be members of multiple clusters. We describe a large-scale, map-reduce-based algorithm to infer these DAGs. Our algorithm works by simply merging nearest neighbor substructures to form a DAG structure. Our algorithm is supported with theoretical guarantees showing its representational capacity over tree-based algorithms. Further, we provide comprehensive empirical experiments on large-scale clustering benchmarks and entity resolution datasets. Our results show that our method is as scalable as and more accurate than state-of-the-art tree-based techniques.
View details
Scalable Hierarchical Agglomerative Clustering
Nick Monath
Guru Prashanth Guruganesh
Andrew McCallum
Gokhan Mergen
Mert Terzihan
Bryon Tjanaka
Yuchen Wu
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2021), 1245–1255
Preview abstract
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method’s scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
View details
Exact and Approximate Hierarchical Clustering Using A*
Craig Greenberg
Sebastian Macaluso
Nicholas Monath
Patrick Flaherty
Kyle Cranmer
Andrew McCallum
Uncertainty in Artificial Intelligence (2021)
Preview abstract
Hierarchical clustering is known to be broadly applicable in myriad domains. Despite its extensive use, existing approximate inference methods are insufficient for applications that require either exact or high-quality approximate solutions.For example, in high energy physics, we are interested in discovering high-quality jet structures, which are hierarchical clusterings of particles.In this paper we view inference as a search problem and focus on inferring high-quality hierarchies for a given cost function, rather than using ad hoc (e.g., greedy or beam) methods. This leads naturally to the use of A*, which has seldom been used for clustering (with the notable exception of \citep{daume2007fast}). Unlike ad hoc search methods, A* carries with it optimality guarantees. However, applying A* search naively leads to a large space and time complexity. To address this challenge, we develop a novel augmented trellis data structure and dynamic programming algorithm for A* that result in substantially improved time and space complexity bounds while still computing the globally optimal hierarchical clustering. We demonstrate that our proposed method increases the number of points for which an exact solution can be found by 25$\%$ compared with previous work \cite{greenberg2020compact}. Furthermore, our approach yields a natural approximation that scales to larger datasets and achieves substantially higher quality results than ad hoc search baselines, motivating its use in applications demanding exact or high-quality approximations.
View details
Preview abstract
Modern learning algorithms excel at producing accurate but complex models of the data. However, deploying such models in the real-world requires extra care: we must ensure their reliability, robustness, and absence of undesired biases. This motivates the development of models that are equally accurate but can be also easily inspected and assessed beyond their predictive performance. To this end, we introduce \emph{contextual explanation networks} ({\CENs})---a class of architectures that learn to predict by generating and utilizing intermediate, simplified probabilistic models. Specifically, {\CENs} generate parameters for intermediate graphical models which are further used for prediction and play the role of explanations. Contrary to the existing \emph{post-hoc} model-explanation tools, {\CENs} learn to predict and to explain simultaneously. Our approach offers two major advantages: (i) for each prediction, valid, instance-specific explanation is generated with no computational overhead and (ii) prediction via explanation acts as a regularizer and boosts performance in data-scarce settings. We analyze the proposed framework theoretically and experimentally. Our results on image and text classification and survival analysis tasks demonstrate that {\CENs} are not only competitive with the state-of-the-art methods but also offer additional insights behind each prediction, that can be valuable for decision support. We also show that while post-hoc methods may produce misleading explanations in certain cases, {\CENs} are consistent and allow to detect such cases systematically.
View details
Preview abstract
Bayesian nonparametric (BNP) models provide elegant methods for discovering underlying latent features within a data set, but inference in such models can be slow. We exploit the fact that completely random measures, which commonly used models like the Dirichlet process and the beta-Bernoulli process can be expressed as, are decomposable into independent sub-measures. We use this decomposition to partition the latent measure into a finite measure containing only instantiated components, and an infinite measure containing all other components. We then select different inference algorithms for the two components: uncollapsed samplers mix well on the finite measure, while collapsed samplers mix well on the infinite, sparsely occupied tail. The resulting hybrid algorithm can be applied to a wide class of models, and can be easily distributed to allow scalable inference without sacrificing asymptotic convergence guarantees.
View details
Big Bird: Transformers for Longer Sequences
Guru Prashanth Guruganesh
Joshua Ainslie
Anirudh Ravula
Qifan Wang
Li Yang
NeurIPS (2020)
Preview abstract
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (in terms of memory mainly) on the sequence length due to their full attention mechanism. To remedy this, we propose, \emph{BigBird}, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that \emph{BigBird} is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis demonstrates the need for having an O(1) global tokens, such as CLS, that attend to the entire sequence as part of the sparse attentions. We show that the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, \emph{BigBird} drastically improves performance on various NLP tasks such as question answering.
View details