Rishi Saket

Rishi Saket

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract With large neural models becoming increasingly accurate and powerful, they have raised privacy and transparency concerns on data usage. Therefore, data platforms, regulations and user expectations are rapidly evolving leading to enforcing privacy via aggregation. We focus on the use case of online advertising where the emergence of aggregate data is imminent and can significantly impact the multi- billion dollar industry. In aggregated datasets, labels are assigned to groups of data points rather than individual data points. This leads to a formulation of a weakly supervised task - Learning from Label Proportions where a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. While learning on aggregate data due to privacy concerns is becoming increasingly popular there is no large scale benchmark for measuring performance and guiding improvements on this important task. We propose LLP-Bench - a web scale benchmark with ∼ 70 datasets and 45 million datapoints. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets, realistic in terms of the sponsored search datasets used and aggregation mechanisms followed. Through more than 3000 experiments we compare the performance of 9 SOTA methods in detail. To the best of our knowledge, this is the first study that compares diverse approaches in such depth. View details
    Generalization and Learnability in Multiple Instance Regression
    Lorne Applebaum
    Ashwinkumar Badanidiyuru Varadaraja
    Chandan Giri
    Proc. UAI (2024)
    Preview abstract Multiple instance regression (MIR) introduced by (Ray & Page, 2001) as an analogue of multiple instance learning (MIL) in which we are given bags of feature-vectors (instances) and for each bag there is a bag-label which is matches the label of one (unknown) primary instance from that bag. The goal is to compute a hypothesis regressor consistent with the underlying instance-labels. A natural approach is to find the best primary instance assignment and regressor optimizing (say) the mse loss on the bags though no formal generalization guarantees were known. Our work is the first to prove generalization error bounds for MIR when the bags are drawn iid at random. Essentially, w.h.p. any MIR regressor with low error on sampled bags also has low error on the underlying instance-label distribution. We next study the complexity of linear regression on MIR bags, shown to be NP-hard in general by (Ray & Page, 2001) who however left open the possibility of arbitrarily good approximations. Significantly strengthening previous work, we prove a strong inapproximability bound: even if there exists zero-loss MIR linear regressor on a collection of 2-sized bags with labels in [−1, 1], it is NP-hard to find an MIR linear regressor with loss < C for some absolute constant C > 0. Our work also proposes two novel model training methods on MIR bags based on (i) weighted assignment loss and, (ii) EM pseudo-labeling, handling the case of overlapping bags which has not previously been studied. We conduct extensive empirical evaluations on synthetic and real-world datasets showing that our method outperforms the baseline MIR methods. View details
    Preview abstract In recent years the framework of learning from label proportions (LLP) has been gaining importance in machine learning. In this, the training examples are aggregated into subsets or bags and only the average label per bag is available for learning an example-level predictor. This generalizes traditional PAC learning which is the special case of unit-sized bags. The computational learning aspects of LLP were studied in recent works [Saket 21, 22] which showed algorithms and hardness for learning halfspaces in the LLP setting. In this work we focus on the intractability of LLP learning boolean functions. Our first result shows that given a collection of bags of size at most 2 which are consistent with an OR function, it is NP-hard to find a CNF of constantly many clauses which "satisfies" any constant-fraction of the bags. This is in contrast with the work of [Saket 21] which gave a (2/5)-approximation using a halfspace, thus our result provides a separation between the ORs and halfspaces as hypotheses for LLP learning ORs. Next, we prove the hardness of satisfying more than 1/2 + o(1) fraction of such bags using a t-DNF (i.e. DNF where each term has <= t literals) for any constant t. In usual PAC learning such a hardness was known [Khot-Saket 08] only for learning noisy ORs. We also study the learnability of parities and show that it is NP-hard to satisfy more than (q/2^{q-1} + o(1))-fraction of q-sized bags which are consistent with a parity using a parity, while the random parity achieves a (1/2^{q-1})-approximation. View details
    Preview abstract Learning from label proportions (LLP) is a generalization of supervised learning in which the training data is available as sets or bags of feature-vectors (instances) along with the average instance-label of each bag. The goal is to train a good instance classifier. While most previous works in LLP have focused on training models on such training data, computational learnability in LLP only recently been explored by [Saket21,Saket22], who showed worst case intractability of properly learning linear threshold functions (LTFs) from label proportions while not ruling out efficient algorithms for this problem under distributional assumptions. In this work we show that it is indeed possible to efficiently learn LTFs using LTFs when given access to random bags of some label proportion in which feature-vectors are independently sampled from a fixed Gaussian distribution N(mu, Sigma), conditioned on the label assigned by the target LTF. Our method estimates a matrix by sampling pairs of feature-vector from the bags with and without replacement, and we prove that the principal component of this matrix necessarily yields the normal vector of the LTF. For some special cases with N(0, I) we provide a simpler expectation based algorithm. We include an experimental evaluation of our learning algorithms along with a comparison of with those of [Saket21, Saket22] and random LTFs, demonstrating the effectiveness of our techniques. View details
    Preview abstract We study the problem of adversarial attack and robustness on tabular datasets with discrete features. The discrete features of a tabular dataset represent high-level meaningful concepts, with different sets of vocabularies, leading to requiring non-uniform robustness. Further, the notion of distance between tabular input instances is not well defined, making the problem of producing adversarial examples with minor perturbations qualitatively more challenging compared to existing methods. Towards this, our paper defines the notion of distance through the lens of feature embeddings, learnt to represent the discrete features. We then formulate the task of generating adversarial examples as a binary set selection problem under non-uniform feature importance. Next, we propose an efficient approximate gradient-descent based algorithm, called Discrete Non-uniform Approximation (DNA) attack, by reformulating the problem into a continuous domain to solve the original optimization problem for generating adversarial examples. We demonstrate the effectiveness of our proposed DNA attack using two large real-world discrete tabular datasets from e-commerce domains for binary classification, where the datasets are heavily biased for one-class. We also analyze challenges for existing adversarial training frameworks for such datasets under our DNA attack. View details
    Preview abstract We study the weak supervision learning problem of Learning from Label Proportions (LLP) where the goal is to learn an instance-level classifier using proportions of various class labels in a bag – a collection of input instances that often can be highly correlated. While representation learning for weakly-supervised tasks is found to be effective, they often require domain knowledge. To the best of our knowledge, representation learning for tabular data (unstructured data containing both continuous and categorical features) are not studied. In this paper, we propose to learn diverse representations of instances within the same bags to effectively utilize the weak bag-level supervision. We propose a domain agnostic LLP method, called "Self Contrastive Representation Learning for LLP" (SelfCLR-LLP) that incorporates a novel self– contrastive function as an auxiliary loss to learn representations on tabular data for LLP. We show that diverse representations for instances within the same bags aid efficient usage of the weak bag- level LLP supervision. We evaluate the proposed method through extensive experiments on real-world LLP datasets from e-commerce applications to demonstrate the effectiveness of our proposed SelfCLR-LLP. View details
    Preview abstract We formulate a new inference task in the domain of multivariate time series forecasting (MTSF), called Variable Subset Forecast (VSF), where only a subset of the variables are available during inference. Variables are absent during inference because of intermittent data collection issues (eg. sensor failures) or domain shift between train / test. To the best of our knowledge, robustness of MSTF models in presence of such failures, has not been studied in the literature. Through extensive evaluation, we first show that the performance of state of the art methods significantly degrade in this setting. We propose a non-parametric, wrapper technique that can be applied on top any existing forecast models. Through thorough experiments across 4 datasets and 5 forecast models, we show that our technique is able to recover the close to 95\% performance of the underlying models even when only 15\% of the original variables are present. View details
    Preview abstract In the framework of learning from label proportions (LLP) the goal is to learn a good instance-level label predictor from the observed label proportions of bags of instances. Most of the LLP algorithms either explicitly or implicitly assume the nature of bag distributions with respect to the actual labels and instances, or cleverly adapt supervised learning techniques to suit LLP. In practical applications however, the scale and nature of data could render such assumptions invalid and the many of the algorithms impractical. In this paper we address the hard problem of solving LLP with provable error bounds while being bag distribution agnostic and model agnostic. We first propose the concept of generalized bags, an extension of bags and then devise an algorithm to combine bag distributions, if possible, into good generalized bag distributions. We show that (w.h.p) any classifier optimizing the squared Euclidean label-proportion loss on such a generalized bag distribution is guaranteed to minimize the instance-level loss as well. The predictive quality of our method is experimentally evaluated and it equals or betters the previous methods on pseudo-synthetic and real-world datasets. View details
    Preview abstract We study the learnability of linear threshold functions (LTFs) in the learning from label proportions (LLP) framework. In this, the feature-vector classifier is learnt from bags of feature-vectors and their corresponding observed label proportions which are satisfied by (i.e., consistent with) some unknown LTF. This problem has been investigated in recent work (Saket21) which gave an algorithm to produce an LTF that satisfies at least (2/5)-fraction of a satisfiable collection of bags, each of size <= 2, by solving and rounding a natural SDP relaxation. However, this SDP relaxation is specific to at most 2-sized bags and does not apply to bags of larger size. In this work we provide a fairly non-trivial SDP relaxation of a non-quadratic formulation for bags of size 3. We analyze its rounding procedure using novel matrix decomposition techniques to obtain an algorithm which outputs an LTF satisfying at least (1/12)-fraction of the bags of size <= 3. We also apply our techniques to bags of size q >= 4 to provide a Omega(1/q)-approximation guarantee for a weaker notion of satisfiability. We include comparative experiments on simulated data demonstrating the applicability of our algorithmic techniques. From the complexity side we provide a hardness reduction to produce instances with bags of any constant size q. Our reduction proves the NP-hardness of satisfying more than 1/q + o(1) fraction of a satisfiable collection of such bags using as hypothesis any function of constantly many LTFs, showing thereby that the problem is harder to approximate as the bag size q increases. Using a strengthened analysis, for q=2 we obtain a (4/9) + o(1) hardness factor for this problem, improving upon the (1/2) + o(1) factor shown by Saket21. View details
    Preview abstract We study the problem of properly learning linear threshold functions (LTFs) in the learning from label proportions (LLP) framework. In this, the learning is on a collection of bags of feature-vectors with only the proportion of labels available for each bag. First, we provide an algorithm that, given a collection of such bags each of size at most two whose label proportions are consistent with (i.e., the bags are satisfied by) an unknown LTF, efficiently produces an LTF that satisfies at least (2/5)-fraction of the bags. If all the bags are non-monochromatic (i.e., bags of size two with differently labeled feature-vectors) the algorithm satisfies at least (1/2)-fraction of them. For the special case of OR over the d-dimensional boolean vectors, we give an algorithm which computes an LTF achieving an additional Ω(1/d) in accuracy for the two cases. Our main result provides evidence that these algorithmic bounds cannot be significantly improved, even for learning monotone ORs using LTFs. We prove that it is NP-hard, given a collection of non-monochromatic bags which are all satisfied by some monotone OR, to compute any function of constantly many LTFs that satisfies (1/2 + ε)-fraction of the bags for any constant ε > 0. This bound is tight for the non-monochromatic bags case. The above is in contrast to the usual supervised learning setup (i.e., unit-sized bags) in which LTFs are efficiently learnable to arbitrary accuracy using linear programming, and even a trivial algorithm (any LTF or its complement) achieves an accuracy of 1/2. These techniques however, fail in the LLP setting. Indeed, we show that the LLP learning of LTFs (even for the special case of monotone ORs) using LTFs dramatically increases in complexity as soon as bags of size two are allowed. Our work gives the first inapproximability for LLP learning LTFs, and a strong complexity separation between LLP and traditional supervised learning. View details