Jump to Content
Ankit Singh Rawat

Ankit Singh Rawat

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Towards Understanding the Role of Attention in Prompt-tuning
    Christos Thrampoulidis
    Mahdi Soltanolkotabi
    Samet Oymak
    ICML 2023 (to appear)
    Preview abstract Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompt-tuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information. View details
    Supervision complexity and its role in knowledge distillation
    Hrayr Harutyunyan
    Aditya Krishna Menon
    International Conference on Learning Representations (2023) (to appear)
    Preview abstract Knowledge distillation is a popular method of compressing a large teacher model (or an ensemble of models) to a more compact student model. While empirically effective, there is limited understanding of why distillation helps, and how to improve it to transfer richer knowledge from the teacher to student. In this paper, we propose a new online distillation algorithm that applies distillation using a sequence of teacher models, corresponding to different checkpoints during teacher training. Intuitively, this gradually increases the complexity of the target functions that the student model is asked to mimic. Formally, we establish generalization bounds that explicate how the target label complexity can benefit the student. We empirically demonstrate that online distillation can significantly improve over regular offline distillation, particularly in scenarios where there is a large teacher-student capacity gap. View details
    Preview abstract This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence. View details
    Preview abstract Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite a growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task. View details
    Teacher Guided Training: An Efficient Framework for Knowledge Transfer
    Chong You
    Himanshu Jain
    Rob Fergus
    International Conference on Learning Representations (2023) (to appear)
    Preview abstract The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we devise teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pre-trained \emph{generative} models while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the ambient space. Furthermore, we can use the teacher to explore the instance space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. Among our empirical evaluations, we find that TGT can improve accuracy on ImageNet-LT by 10% compared to natural baseline and match accuracy on sentiment analysis on Amazon reviews without the need for pretraining. View details
    Serving Graph Compression for Graph Neural Networks
    Cho-Jui Hsieh
    International Conference on Learning Representations (2023) (to appear)
    Preview abstract Serving a GNN model in online applications is challenging --- one has to propagate the information from training nodes to testing nodes to achieve the best performance, while storing the whole training set (including training graph and node features) during inference time is prohibitive for most of the real world applications. We tackle this serving space compression problem in the paper, where the goal is to compress the storage requirement for GNN serving. Given a model to be served, the proposed method constructs a small set of virtual representative nodes to replace the original training nodes, so that users just need to replace the original training set by this virtual representative set to reduce the space requirement for serving, without the need of changing the actual GNN model and the forward pass. We carefully analyze the error in the forward pass and derive simple ways to construct the node features and graph of virtual representative nodes to minimize the approximation error. Experimental results demonstrate that the proposed method can significantly reduce the serving space requirement for GNN inference. View details
    Preview abstract Many practical settings allow a learner to defer predictions to one or more costly experts. For example, the learning to defer paradigm allows a learner to defer to a human expert, at some monetary cost. Similarly, the adaptive inference paradigm allows a base model to defer to one or more large models, at some computational cost. The goal in these settings is to learn classification and deferral mechanisms to optimise a suitable accuracy-cost tradeoff. To achieve this, a central issue studied in prior work is the design of a coherent loss function for both mechanisms. In this work, we demonstrate that existing losses have two subtle limitations: they can encourage underfitting when there is a high cost of deferring, and the deferral function can have a weak dependence on the base model predictions. To resolve these issues, we propose a post-hoc training scheme: we train a deferral function on top of a base model, with the objective of predicting to defer when the base model's error probability exceeds the cost of the expert model. This may be viewed as applying a partial surrogate to the ideal deferral loss, which can lead to a tighter approximation and thus better performance. Empirically, we verify the efficacy of post-hoc training on benchmarks for learning to defer and adaptive inference. View details
    Preview abstract Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query. There are two broad flavours of such models: cross-attention (CA) models, which learn a joint embedding for the query and document, and dual-encoder (DE) models, which learn separate embeddings for the query and document. Empirically, CA models are often found to be more accurate, which has motivated a series of works seeking to bridge this gap. However, a more fundamental question remains less explored: does this performance gap reflect an inherent limitation in the capacity of DE models, or a limitation in the training of such models? And does such an understanding suggest a principled means of improving DE models? In this paper, we study these questions, with three contributions. First, we establish theoretically that with a sufficiently large embedding dimension, DE models have the capacity to model a broad class of score distributions. Second, we show empirically that on real-world problems, DE models may overfit to spurious correlations in the training set, and thus under-perform on test samples. To mitigate this behaviour, we propose a novel distillation strategy that leverages confidence margins, and confirm its practical efficacy on the MSMARCO-Passage benchmark. View details
    A Fourier Approach to Mixture Learning
    Mingda Qiao
    Guru Prashanth Guruganesh
    Avinava Dubey
    Conference on Neural Information Processing Systems (2022) (to appear)
    Preview abstract We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\N(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \R^d$ up to a small error. The hardness of this learning problem can be measured by the \emph{separation} $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\poly(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. This was previously open even in one dimension. More generally, our algorithm runs in time $\poly(k)\cdot f(d, \Delta, \eps)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\eps$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution. View details
    Long-tail learning via logit adjustment
    Aditya Krishna Menon
    Himanshu Jain
    International Conference on Learning Representations (ICLR) 2021
    Preview abstract Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques involve logit adjustment based on the label priors, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a high relative margin between logits of rare versus dominant labels. Our techniques unify and generalise several recent proposals in the literature, while possessing stronger theoretical guarantees and empirical performance. View details
    A statistical perspective on distillation
    Aditya Krishna Menon
    International Conference on Machine Learning (ICML) 2021 (to appear)
    Preview abstract Knowledge distillation is a technique for improving a ``student'' model by replacing its one-hot training labels with a label distribution obtained from a ``teacher'' model. Despite its broad success, several basic questions --- e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? --- have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a ``Bayes teacher'' providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a ``good'' teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval. View details
    Preview abstract Negative sampling is a widely adopted technique to enable efficient training in settings with a large number of classes. Typically, negative sampling approaches aim at approximating the value or gradient of the computationally expensive loss function that takes all the negative labels into account. In this work, we study the connection between negative sampling approaches and loss modification techniques for countering label imbalance. We show that different (bias) correction strategies that accompany negative sampling approaches can have unintended consequences on the model's performance on various data sub-populations. We then propose a unified approach to tackle both sampling bias, arising from working with a subset of all negative classes, and labeling bias, which is inherently present in the data due to label-imbalance. Finally, we verify our analysis and demonstrate the utility of our unified approach through empirical evaluation on standard image classification and retrieval benchmarks. View details
    Preview abstract Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems. View details
    Overparameterisation and worst-case generalisation: friend or foe?
    Aditya Krishna Menon
    International Conference on Learning Representations (ICLR) 2021
    Preview abstract Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples.However, several recent works have revealed that such models’ good average performance does not always translate to good worst-case performance: in particular, they may perform poorly on under-represented subgroups in the training set. In this paper, we show that in certain settings, overparameterised models’ bias against under-represented samples may be easily corrected via post-hoc processing. Specifically, we demonstrate such models’ bias can be restricted to their classification layers, and manifests in structured shifts in predictions for rare subgroups. We de-tail two post-hoc correction techniques to eliminate this bias, which operate purely on the original models’ outputs. We empirically verify that with such post-hoc correction, overparameterisation can improve worst-case performance. View details
    Preview abstract Transformer networks use pairwise attention to compute contextual embeddings of their inputs, and have achieved the state of the art performance in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, several fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we provide a \emph{unifying framework} that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we show that a sparse attention model can provably \emph{universally approximate} any sequence-to-sequence functions. Surprisingly, our results show the existence of attention models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns and levels of sparsity on standard NLP tasks. View details
    Preview abstract Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification. View details
    Adversarial robustness via robust low rank representations
    Aravindan Vijayaraghavan
    Himanshu Jain
    Pranjal Awasthi
    2020 Conference on Neural Information Processing (NeurIPS)
    Preview abstract Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time. In this work we highlight the benefits of natural low rank representations that often exist for real data such as images, for training neural networks with certified robustness guarantees. Our first contribution is for certified robustness to perturbations measured in $\ell_2$ norm. We exploit low rank data representations to provide improved guarantees over state-of-the-art randomized smoothing-based approaches on standard benchmark datasets such as CIFAR-10. Our second contribution is for the more challenging setting of certified robustness to perturbations measured in $\ell_\infty$ norm. We demonstrate empirically that natural low rank representations have inherent robustness properties that can be leveraged to provide significantly better guarantees for certified robustness to $\ell_\infty$ perturbations. Our certificate of $\ell_\infty$ robustness relies on a natural quantity involving the $\infty \to 2$ matrix operator norm associated with the representation, to translate robustness guarantees from $\ell_2$ to $\ell_\infty$ perturbations. A key technical ingredient for our certification guarantees is a fast algorithm based on the multiplicative weights update method to provide sharp upper bounds on the above matrix norm View details
    The Generalized Lasso for Sub-gaussian Measurements with Dithered Quantization
    Christos Thrampoulidis
    IEEE Transactions on Information Theory (2020)
    Preview abstract In the problem of structured signal recovery from high-dimensional linear observations, it is commonly assumed that full-precision measurements are available. Under this assumption, the recovery performance of the popular Generalized Lasso (G-Lasso) is by now well-established. In this paper, we extend these types of results to the practically relevant settings with quantized measurements. We study two extremes of the quantization schemes, namely, uniform and one-bit quantization; the former imposes no limit on the number of quantization bits, while the second only allows for one bit. In the presence of a uniform dithering signal and when measurement vectors are sub-gaussian, we show that the same algorithm (i.e., the G-Lasso) has favorable recovery guarantees for both uniform and one-bit quantization schemes. Our theoretical results, shed light on the appropriate choice of the range of values of the dithering signal and accurately capture the error dependence on the problem parameters. For example, our error analysis shows that the G-Lasso with one-bit uniformly dithered measurements leads to only a logarithmic rate loss compared to the full-precision measurements View details
    Preview abstract Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation, which we further validate with our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling. View details
    Preview abstract We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentralized learning such as the distributed SGD or Federated Averaging may lead to trivial or extremely poor classifiers. In particular, for the embedding based classifiers, all the class embeddings might collapse to a single point. To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes to be spreadout in the embedding space. We show, both theoretically and empirically, that FedAwS can almost match the performance of conventional learning where users have access to negative labels. We further extend the proposed method to the settings with large output spaces. View details
    Can gradient clipping mitigate label noise?
    Aditya Krishna Menon
    International Conference on Learning Representations (ICLR) (2020)
    Preview abstract Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we study gradient clipping from an robustness lens: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that gradient clipping does not in general provide robustness to label noise. On the other hand, we show that robustness is achieved by a form of loss clipping. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically. View details
    Preview abstract Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other architectures that can compute contextual mappings and empirically evaluate them. View details
    Robust Direction of Arrival Estimation in the Presence of Array Faults using Snapshot Diversity
    Gary Lee
    Gregory W Wornell
    7th IEEE Global Conference on Signal and Information Processing (GlobalSIP) 2019
    Preview abstract Many direction-of-arrival (DOA) estimation algorithms require accurate measurements from all sensing elements on an antenna array. However, in various practical settings, it becomes imperative to perform DOA estimation even in the presence of faulty elements. In this work, we develop an algorithm that can jointly estimate the DOA of sources and the locations of the faulty elements. This is achieved by introducing weights that describe the degree of outlierness of each element. Further, for situations where only single snapshots are available, we propose a new snapshot diversity formulation for which our algorithm can still be applied. Simulation results over four different fault models demonstrate that the proposed algorithm robustly estimates DOAs and accurately identifies the faulty elements. View details
    Reliable Distributed Clustering with Redundant Data Assignment
    Venkata Gandikota
    Arya Mazumdar
    ICML Workshop on Coding Theory for Large-scale Machine Learning (2019)
    Preview abstract In this work we present distributed generalized clustering algorithms (with k-means and PCA as special cases) that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about data even when some machines fail to respond. The assignment scheme leads to distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems. View details
    Lifting high-dimensional non-linear models with Gaussian regressors
    Christos Thrampoulidis
    22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019
    Preview abstract We study the problem of recovering a structured signal $\mathbf{x}_0$ from high-dimensional data $\y_i=f(\mathbf{a}_i^T\mathbf{x}_0)$ for some nonlinear (and potentially unknown) link function $f$, when the regressors $\ab_i$ are iid Gaussian. Brillinger (1982) showed that ordinary least-squares estimates $\x_0$ up to a constant of proportionality $\mu_\ell$, which depends on $f$. Recently, Plan \& Vershynin (2015) extended this result to the high-dimensional setting deriving sharp error bounds for the generalized Lasso. Unfortunately, both least-squares and the Lasso fail to recover $\mathbf{x}_0$ when $\mu_\ell=0$. For example, this includes all even link functions. We resolve this issue by proposing and analyzing an alternative convex recovery method. In a nutshell, our method treats such link functions as if they were linear in a lifted space of higher-dimension. Interestingly, our error analysis captures the effect of both the nonlinearity and the problem's geometry in a few simple summary parameters. View details
    Learning and Recovery in the ReLU Model
    Arya Mazumdar
    Proceedings of 57th Annual Allerton Conference on Communication, Control, and Computing, 2019
    Preview abstract Rectified linear units, or ReLUs, have become a preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a simple network with ReLU activations. The first problem we study corresponds to learning a generative model in the presence of nonlinearity (modeled by the ReLU functions). Given a set of signal vectors $\mathbf{y}^i \in \mathbb{R}^d, i =1, 2, \dots , n$, we aim to learn the network parameters, i.e., the $d\times k$ matrix $A$, under the model $\mathbf{y}^i = \mathrm{ReLU}(A\mathbf{c}^i +\mathbf{b})$, where $\mathbf{b}\in \mathbb{R}^d$ is a random bias vector. We show that it is possible to recover the column space of $A$ within an error of $O(d)$ (in Frobenius norm) under certain conditions on the distribution of $\mathbf{b}$. The second problem we consider is that of robust recovery of the signal in the presence of outliers. In this setting, we are interested in recovering the latent vector $\mathbf{c}$ from its noisy nonlinear images of the form $\mathbf{v} = \mathrm{ReLU}(A\mathbf{c}) + \mathbf{e}+\mathbf{w}$, where $\mathbf{e} \in \mathbb{R}^d$ denotes the outliers with sparsity $s$ and $\mathbf{w} \in \mathbb{R}^d$ denote the dense but small noise. We show that the LASSO algorithm recovers $\mathbf{c} \in \mathbb{R}^k$ within an $\ell_2$-error of $O\big(\sqrt{{((k+s)\log d})/{d}}\big)$ when $A$ is a random Gaussian matrix. View details
    Multilabel reductions: what is my loss optimising?
    Aditya Krishna Menon
    Advances in Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging. A popular approach to this problem is to employ a reduction to a suitable series of binary or multiclass problems (e.g., computing a softmax based cross-entropy over the relevant labels). While such methods have seen empirical success, less is understood about how well they approximate two fundamental performance measures: the precision and recall@k. In this paper, we study three commonly used reductions, and two new reductions based on a normalised loss function, wherein the contribution of each instance is normalised by the number of relevant labels. A surprising outcome of our study is that that each reduction is provably consistent with respect to either precision or recall, but not both. Further, we explicate that the probability scores obtained from reductions focussed on precision must be interpreted with caution. We empirically validate our results on real-world datasets, showing in particular that our normalised loss function yields recall gains over existing reductions. View details
    Sampled softmax with random fourier features
    Jiecao (Jack) Chen
    Advances in Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the gradient based on these classes, known as the \emph{sampled softmax} method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach (without theoretical justification) involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the {\em Random Fourier Softmax} (RF-softmax) method that utilizes the powerful Random Fourier features to enable more efficient and accurate sampling from the (approximate) softmax distribution. We show that RF-softmax leads to low biased estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes. View details
    Robust Gradient Descent via Moment Encoding and LDPC Codes
    Raj Kumar Maity
    Arya Mazumdar
    IEEE International Symposium on Information Theory (ISIT) 2019
    Preview abstract This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of straggling processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. For a random model for stragglers, we obtain the convergence guarantees for the proposed solution by viewing it as the stochastic gradient descent method. Furthermore, the proposed solution outperforms the existing schemes in a real distributed computing setup. View details
    No Results Found