# Ankit Singh Rawat

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Serving Graph Compression for Graph Neural Networks

Cho-Jui Hsieh

International Conference on Learning Representations (2023) (to appear)

Preview abstract
Serving a GNN model in online applications is challenging --- one has to propagate the information from training nodes to testing nodes to achieve the best performance, while storing the whole training set (including training graph and node features) during inference time is prohibitive for most of the real world applications. We tackle this serving space compression problem in the paper, where the goal is to compress the storage requirement for GNN serving. Given a model to be served, the proposed method constructs a small set of virtual representative nodes to replace the original training nodes, so that users just need to replace the original training set by this virtual representative set to reduce the space requirement for serving, without the need of changing the actual GNN model and the forward pass.
We carefully analyze the error in the forward pass and derive simple ways to construct the node features and graph of virtual representative nodes to minimize the approximation error. Experimental results demonstrate that the proposed method can significantly reduce the serving space requirement for GNN inference.
View details

On Emergence of Activation Sparsity in Trained Transformers

Chong You

Daliang Li

Ke Ye

International Conference on Learning Representations (2023) (to appear)

Preview abstract
This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
View details

Supervision complexity and its role in knowledge distillation

Hrayr Harutyunyan

Aditya Krishna Menon

International Conference on Learning Representations (2023) (to appear)

Preview abstract
Knowledge distillation is a popular method of compressing a large teacher model (or an ensemble of models) to a more compact student model. While empirically effective, there is limited understanding of why distillation helps, and how to improve it to transfer richer knowledge from the teacher to student. In this paper, we propose a new online distillation algorithm that applies distillation using a sequence of teacher models, corresponding to different checkpoints during teacher training. Intuitively, this gradually increases the complexity of the target functions that the student model is asked to mimic. Formally, we establish generalization bounds that explicate how the target label complexity can benefit the student. We empirically demonstrate that online distillation can significantly improve over regular offline distillation, particularly in scenarios where there is a large teacher-student capacity gap.
View details

Teacher Guided Training: An Efficient Framework for Knowledge Transfer

Chong You

Himanshu Jain

Rob Fergus

International Conference on Learning Representations (2023) (to appear)

Preview abstract
The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we devise teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pre-trained \emph{generative} models while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the ambient space. Furthermore, we can use the teacher to explore the instance space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. Among our empirical evaluations, we find that TGT can improve accuracy on ImageNet-LT by 10% compared to natural baseline and match accuracy on sentiment analysis on Amazon reviews without the need for pretraining.
View details

A Fourier Approach to Mixture Learning

Mingda Qiao

Guru Prashanth Guruganesh

Avinava Dubey

Conference on Neural Information Processing Systems (2022) (to appear)

Preview abstract
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\N(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \R^d$ up to a small error. The hardness of this learning problem can be measured by the \emph{separation} $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\poly(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$.
In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. This was previously open even in one dimension. More generally, our algorithm runs in time $\poly(k)\cdot f(d, \Delta, \eps)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\eps$.
Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.
View details

In defense of dual-encoders for neural ranking

Aditya Krishna Menon

International Conference on Machine Learning (ICML) (2022)

Preview abstract
Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query. There are two broad flavours of such models: cross-attention (CA) models, which learn a joint embedding for the query and document, and dual-encoder (DE) models, which learn separate embeddings for the query and document. Empirically, CA models are often found to be more accurate, which has motivated a series of works seeking to bridge this gap. However, a more fundamental question remains less explored: does this performance gap reflect an inherent limitation in the capacity of DE models, or a limitation in the training of such models? And does such an understanding suggest a principled means of improving DE models? In this paper, we study these questions, with three contributions. First, we establish theoretically that with a sufficiently large embedding dimension, DE models have the capacity to model a broad class of score distributions. Second, we show empirically that on real-world problems, DE models may overfit to spurious correlations in the training set, and thus under-perform on test samples. To mitigate this behaviour, we propose a novel distillation strategy that leverages confidence margins, and confirm its practical efficacy on the MSMARCO-Passage benchmark.
View details

Post-hoc estimators for learning to defer to an expert

Aditya Krishna Menon

Advances in Neural Information Processing Systems (2022)

Preview abstract
Many practical settings allow a learner to defer predictions to one or more costly experts. For example, the learning to defer paradigm allows a learner to defer to a human expert, at some monetary cost. Similarly, the adaptive inference paradigm allows a base model to defer to one or more large models, at some computational cost. The goal in these settings is to learn classification and deferral mechanisms to optimise a suitable accuracy-cost tradeoff. To achieve this, a central issue studied in prior work is the design of a coherent loss function for both mechanisms. In this work, we demonstrate that existing losses have two subtle limitations: they can encourage underfitting when there is a high cost of deferring, and the deferral function can have a weak dependence on the base model predictions. To resolve these issues, we propose a post-hoc training scheme: we train a deferral function on top of a base model, with the objective of predicting to defer when the base model's error probability exceeds the cost of the expert model. This may be viewed as applying a partial surrogate to the ideal deferral loss, which can lead to a tighter approximation and thus better performance. Empirically, we verify the efficacy of post-hoc training on benchmarks for learning to defer and adaptive inference.
View details

Disentangling sampling and labeling bias for learning in large-output spaces

Aditya Krishna Menon

International Conference on Machine Learning (ICML) 2021

Preview abstract
Negative sampling is a widely adopted technique to enable efficient training in settings with a large number of classes. Typically, negative sampling approaches aim at approximating the value or gradient of the computationally expensive loss function that takes all the negative labels into account. In this work, we study the connection between negative sampling approaches and loss modification techniques for countering label imbalance. We show that different (bias) correction strategies that accompany negative sampling approaches can have unintended consequences on the model's performance on various data sub-populations. We then propose a unified approach to tackle both sampling bias, arising from working with a subset of all negative classes, and labeling bias, which is inherently present in the data due to label-imbalance. Finally, we verify our analysis and demonstrate the utility of our unified approach through empirical evaluation on standard image classification and retrieval benchmarks.
View details

RankDistil: Distillation for Ranking

Aditya Krishna Menon

AISTATS 2021 (2021)

Preview abstract
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.
View details

Overparameterisation and worst-case generalisation: friend or foe?

Aditya Krishna Menon

International Conference on Learning Representations (ICLR) 2021

Preview abstract
Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples.However, several recent works have revealed that such models’ good average performance does not always translate to good worst-case performance: in particular, they may perform poorly on under-represented subgroups in the training set. In this paper, we show that in certain settings, overparameterised models’ bias against under-represented samples may be easily corrected via post-hoc processing. Specifically, we demonstrate such models’ bias can be restricted to their classification layers, and manifests in structured shifts in predictions for rare subgroups. We de-tail two post-hoc correction techniques to eliminate this bias, which operate purely on the original models’ outputs. We empirically verify that with such post-hoc correction, overparameterisation can improve worst-case performance.
View details

Long-tail learning via logit adjustment

Aditya Krishna Menon

Himanshu Jain

International Conference on Learning Representations (ICLR) 2021

Preview abstract
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques involve logit adjustment based on the label priors, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a high relative margin between logits of rare versus dominant labels. Our techniques unify and generalise several recent proposals in the literature, while possessing stronger theoretical guarantees and empirical performance.
View details

A statistical perspective on distillation

Aditya Krishna Menon

International Conference on Machine Learning (ICML) 2021 (to appear)

Preview abstract
Knowledge distillation is a technique for improving a ``student'' model by replacing its one-hot training labels with a label distribution obtained from a ``teacher'' model. Despite its broad success, several basic questions --- e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? --- have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a ``Bayes teacher'' providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a ``good'' teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.
View details

Can gradient clipping mitigate label noise?

Aditya Krishna Menon

International Conference on Learning Representations (ICLR) (2020)

Preview abstract
Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we study gradient clipping from an robustness lens: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that gradient clipping does not in general provide robustness to label noise. On the other hand, we show that robustness is achieved by a form of loss clipping. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically.
View details

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Chulhee Yun

Advances in Neural Information Processing Systems (2020)

Preview abstract
Transformer networks use pairwise attention to compute contextual embeddings of their inputs, and have achieved the state of the art performance in many NLP tasks.
However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, several fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we provide a \emph{unifying framework} that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we show that a sparse attention model can provably \emph{universally approximate} any sequence-to-sequence functions. Surprisingly, our results show the existence of attention models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns and levels of sparsity on standard NLP tasks.
View details

Modifying Memories in Transformer Models

Chen Zhu

Daliang Li

International Conference on Machine Learning (ICML) 2021 (2020)

Preview abstract
Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification.
View details

Are Transformers universal approximators of sequence-to-sequence functions?

Chulhee Yun

International Conference on Learning Representations (ICLR) (2020)

Preview abstract
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other architectures that can compute contextual mappings and empirically evaluate them.
View details

Low-Rank Bottleneck in Multi-head Attention Models

Chulhee Yun

International Conference on Machine Learning (ICML) 2020

Preview abstract
Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation, which we further validate with our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.
View details

Adversarial robustness via robust low rank representations

Aravindan Vijayaraghavan

Himanshu Jain

Pranjal Awasthi

2020 Conference on Neural Information Processing (NeurIPS)

Preview abstract
Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time. In this work we highlight the benefits of natural low rank representations that often exist for real data such as images, for training neural networks with certified robustness guarantees.
Our first contribution is for certified robustness to perturbations measured in $\ell_2$ norm. We exploit low rank data representations to provide improved guarantees over state-of-the-art randomized smoothing-based approaches on standard benchmark datasets such as CIFAR-10.
Our second contribution is for the more challenging setting of certified robustness to perturbations measured in $\ell_\infty$ norm. We demonstrate empirically that natural low rank representations have inherent robustness properties that can be leveraged to provide significantly better guarantees for certified robustness to $\ell_\infty$ perturbations. Our certificate of $\ell_\infty$ robustness relies on a natural quantity involving the $\infty \to 2$ matrix operator norm associated with the representation, to translate robustness guarantees from $\ell_2$ to $\ell_\infty$ perturbations. A key technical ingredient for our certification guarantees is a fast algorithm based on the multiplicative weights update method to provide sharp upper bounds on the above matrix norm
View details

Learning and Recovery in the ReLU Model

Arya Mazumdar

Proceedings of 57th Annual Allerton Conference on Communication, Control, and Computing, 2019

Preview abstract
Rectified linear units, or ReLUs, have become a preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a simple network with ReLU activations. The first problem we study corresponds to learning a generative model in the presence of nonlinearity (modeled by the ReLU functions). Given a set of signal vectors $\mathbf{y}^i \in \mathbb{R}^d, i =1, 2, \dots , n$, we aim to learn the network parameters, i.e., the $d\times k$ matrix $A$, under the model $\mathbf{y}^i = \mathrm{ReLU}(A\mathbf{c}^i +\mathbf{b})$, where $\mathbf{b}\in \mathbb{R}^d$ is a random bias vector. We show that it is possible to recover the column space of $A$ within an error of $O(d)$ (in Frobenius norm) under certain conditions on the distribution of $\mathbf{b}$.
The second problem we consider is that of robust recovery of the signal in the presence of outliers. In this setting, we are interested in recovering the latent vector $\mathbf{c}$ from its noisy nonlinear images of the form $\mathbf{v} = \mathrm{ReLU}(A\mathbf{c}) + \mathbf{e}+\mathbf{w}$, where $\mathbf{e} \in \mathbb{R}^d$ denotes the outliers with sparsity $s$ and $\mathbf{w} \in \mathbb{R}^d$ denote the dense but small noise. We show that the LASSO algorithm recovers $\mathbf{c} \in \mathbb{R}^k$ within an $\ell_2$-error of $O\big(\sqrt{{((k+s)\log d})/{d}}\big)$ when $A$ is a random Gaussian matrix.
View details

Lifting high-dimensional non-linear models with Gaussian regressors

Christos Thrampoulidis

22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019

Preview abstract
We study the problem of recovering a structured signal $\mathbf{x}_0$ from high-dimensional data $\y_i=f(\mathbf{a}_i^T\mathbf{x}_0)$ for some nonlinear (and potentially unknown) link function $f$, when the regressors $\ab_i$ are iid Gaussian. Brillinger (1982) showed that ordinary least-squares estimates $\x_0$ up to a constant of proportionality $\mu_\ell$, which depends on $f$.
Recently, Plan \& Vershynin (2015) extended this result to the high-dimensional setting deriving sharp error bounds for the generalized Lasso. Unfortunately, both least-squares and the Lasso fail to recover $\mathbf{x}_0$ when $\mu_\ell=0$. For example, this includes all even link functions. We resolve this issue by proposing and analyzing an alternative convex recovery method. In a nutshell, our method treats such link functions as if they were linear in a lifted space of higher-dimension. Interestingly, our error analysis captures the effect of both the nonlinearity and the problem's geometry in a few simple summary parameters.
View details

Robust Direction of Arrival Estimation in the Presence of Array Faults using Snapshot Diversity

Gary Lee

Gregory W Wornell

7th IEEE Global Conference on Signal and Information Processing (GlobalSIP) 2019

Preview abstract
Many direction-of-arrival (DOA) estimation algorithms require accurate measurements from all sensing elements on an antenna array. However, in various practical settings, it becomes imperative to perform DOA estimation even in the presence of faulty elements. In this work, we develop an algorithm that can jointly estimate the DOA of sources and the locations of the faulty elements. This is achieved by introducing weights that describe the degree of outlierness of each element. Further, for situations where only single snapshots are available, we propose a new snapshot diversity formulation for which our algorithm can still be applied. Simulation results over four different fault models demonstrate that the proposed algorithm robustly estimates DOAs and accurately identifies the faulty elements.
View details

Sampled softmax with random fourier features

Jiecao (Jack) Chen

Advances in Neural Information Processing Systems (NeurIPS) (2019)

Preview abstract
The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the gradient based on these classes, known as the \emph{sampled softmax} method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach (without theoretical justification) involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the {\em Random Fourier Softmax} (RF-softmax) method that utilizes the powerful Random Fourier features to enable more efficient and accurate sampling from the (approximate) softmax distribution. We show that RF-softmax leads to low biased estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes.
View details

Multilabel reductions: what is my loss optimising?

Aditya Krishna Menon

Advances in Neural Information Processing Systems (NeurIPS) (2019)

Preview abstract
Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging. A popular approach to this problem is to employ a reduction to a suitable series of binary or multiclass problems (e.g., computing a softmax based cross-entropy over the relevant labels). While such methods have seen empirical success, less is understood about how well they approximate two fundamental performance measures: the precision and recall@k. In this paper, we study three commonly used reductions, and two new reductions based on a normalised loss function, wherein the contribution of each instance is normalised by the number of relevant labels. A surprising outcome of our study is that that each reduction is provably consistent with respect to either precision or recall, but not both. Further, we explicate that the probability scores obtained from reductions focussed on precision must be interpreted with caution. We empirically validate our results on real-world datasets, showing in particular that our normalised loss function yields recall gains over existing reductions.
View details

Robust Gradient Descent via Moment Encoding and LDPC Codes

Raj Kumar Maity

Arya Mazumdar

IEEE International Symposium on Information Theory (ISIT) 2019

Preview abstract
This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of straggling processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. For a random model for stragglers, we obtain the convergence guarantees for the proposed solution by viewing it as the stochastic gradient descent method. Furthermore, the proposed solution outperforms the existing schemes in a real distributed computing setup.
View details

Reliable Distributed Clustering with Redundant Data Assignment

Venkata Gandikota

Arya Mazumdar

ICML Workshop on Coding Theory for Large-scale Machine Learning (2019)

Preview abstract
In this work we present distributed generalized clustering algorithms (with k-means and PCA as
special cases) that can handle large scale data across multiple machines in spite of straggling or
unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about data even when some machines fail to respond. The assignment scheme leads to
distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.
View details

No Results Found