Rajesh Jayaram
Rajesh is a Research Scientist at Google NYC, where he is part of the NYC Algorithms and Optimization team. He received his PhD in Computer Science from Carnegie Mellon University in 2021, where he was advised by David P. Woodruff. Prior to that, in 2017 he received his B.S. in Mathematics and Computer Science from Brown University. His research focuses primarily on sublinear algorithms, especially randomized sketching and streaming algorithms for problems in big-data. In general, he enjoys thinking about dimensionality reduction: namely, to what extent can we compress the significant components of an enormous, noisy data-set? His work also spans property testing, machine learning, and optimization.
He also has a personal homepage.
He also has a personal homepage.
Authored Publications
Sort By
Preview abstract
Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding x ∈ R^d per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring.
In this paper, we introduce MUVERA (Multi-Vector Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality ε-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5× fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10% improved recall with 90% lower latency.
View details
HyperAttention: Large-scale Attention in Linear Time
Amin Karbasi
Amir Zandieh
Insu Han
David Woodruff
HyperAttention: Long-context Attention in Near-Linear Time (2024) (to appear)
Preview abstract
In this paper, we introduce a novel approximate attention mechanism dubbed ``HyperAttention``. Despite the rapidly increasing size and complexity of contexts used with Large Language Models (LLM), there is still a dire lack of computationally efficient attention mechanisms scaling better than the naive quadratic time. HyperAttention addresses this gap: it delivers provably linear time complexity with respect to the size of the context, while only incurring a negligible loss in downstream quality. Distinctively, it integrates the principles of Locality Sensitive Hashing (LSH), for efficient detection of heavy elements, along with uniform column sampling, allowing for a good approximation both of the heavy and light components of the attention matrix. HyperAttention provably approximates the attention layer in \textit{linear time}, making it the first practical linear time approximate attention mechanism. Crucially, HyperAttention has a highly-modular design, allowing seamless integration of other rapid low-level implementations, most notably FlashAttention. Empirical evaluations indicate that HyperAttention surpasses the existing methods, achieving orders of magnitude speed-up when compared to prevalent state-of-the-art solutions such as Flash Attention. This breakthrough presents significant implications for enabling the scalability of LLMs to significantly larger contexts.
View details
Streaming Euclidean MST to a Constant Factor
Amit Levi
Erik Waingarten
Xi Chen
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation.
View details
Optimal Fully Dynamic k-Center Clustering for Adaptive and Oblivious Adversaries
Preview
Monika Henzinger
Andreas Wiese
Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
Stars: Tera-Scale Graph Building for Clustering and Learning
Warren Schudy
NeurIPS 2022 (2022)
Preview abstract
A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly controlled by the sparsity of the similarity graph. In this work, we present Stars: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. To complement our results, we have deployed Stars for multiple data sets allowing for graph building at the Tera-Scale, i.e., for graphs with hundreds of billions of nodes and tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons and significant improvements in runtime with negligible quality loss.
View details
Preview abstract
In the G-sampling problem, the goal is to output an index i of a vector f∈ℝn, such that for all coordinates j∈[n],
Pr[i=j]=(1±ϵ)G(fj)∑k∈[n]G(fk)+γ,
where G:ℝ→ℝ≥0 is some non-negative function. If ϵ=0 and γ=1/poly(n), the sampler is called perfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are not truly perfect, since their output distribution is only point-wise γ=1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study of truly perfect samplers, with ϵ=γ=0, and comprehensively investigate their complexity in the data stream and sliding window models.
View details
New Streaming Algorithms for High Dimensional EMD and MST
Amit Levi
Erik Waingarten
Xi Chen
To be submitted to ACM Symposium on Theory of Computing (STOC) (2022)
Preview abstract
We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,\Delta\}^d$, and~computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,\Delta\}^d$ of size $n$. We~consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for MST and a two-pass streaming algorithm~for EMD, both achieving an approximation factor of $\tilde{O}(\log n)$ and using $\polylog(n,d,\Delta)$-space only. Furthermore, our algorithm for EMD can be compressed to a single pass using $\Delta d \cdot \polylog( n,d)$-space whenever ${|A \cap B|}/{|A \cup B|} \leq 1-\Omega(1)$. Previously, the best known sublinear-space streaming algorithms for either problem achieved an approximation of $O(\min\{ \log n , \log (\Delta d)\} \log n)$ \cite{AIK08, BDIRW20}. For MST, we also prove that any constant space streaming algorithm can only achieve an approximation of $\Omega(\log n)$, analogous to the $\Omega(\log n)$ lower bound for EMD of \cite{AIK08}.
Our algorithms are based on an improved analysis of a recursive space partitioning method known generically as the Quadtree. Specifically, we show that the Quadtree achieves an $\tilde{O}(\log n)$ approximation for both EMD and MST, improving on the $O(\min\{ \log n , \log (\Delta d)\} \log n)$ approximation of \cite{AIK08, BDIRW20}. %The main conceptual contribution is an analytical framework for studying the quadtree which goes beyond worst-case distortion of randomized tree embeddings.
View details
Preview abstract
The tremendous success of deep neural networks has motivated the need to better understand the
fundamental properties of these networks, but many of the theoretical results proposed have only
been for shallow networks. In this paper, we study an important primitive for understanding the
meaningful input space of a deep network: span recovery. For k < n, let A ∈ R
k×n be the
innermost weight matrix of an arbitrary feed forward neural network M : R
n → R, so M(x) can be
written as M(x) = σ(Ax), for some network σ : R
k → R. The goal is then to recover the row span
of A given only oracle access to the value of M(x). We show that if M is a multi-layered network
with ReLU activation functions, then partial recovery is possible: namely, we can provably recover
k/2 linearly independent vectors in the row span of A using poly(n) non-adaptive queries to M(x).
Furthermore, if M has differentiable activation functions, we demonstrate that full span recovery is
possible even when the output is first passed through a sign or 0/1 thresholding function; in this case
our algorithm is adaptive. Empirically, we confirm that full span recovery is not always possible,
but only for unrealistically thin layers. For reasonably wide networks, we obtain full span recovery
on both random networks and networks trained on MNIST data. Furthermore, we demonstrate the
utility of span recovery as an attack by inducing neural networks to misclassify data obfuscated by
controlled random noise as sensical inputs.
View details