Jump to Content
Aravindan Raghuveer

Aravindan Raghuveer

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines for the LLP Binary Classification problem on various dataset types - Small Tabular, Large Tabular and Images. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples. View details
    Preview abstract Covariate shift in the test data is a common practical phenomena that can significantly downgrade both the accuracy and the fairness performance of the model. Ensuring fairness across different sensitive groups under covariate shift is of paramount importance due to societal implications like criminal justice. We operate in the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available. Towards improving fairness under this highly challenging yet realistic scenario, we make three contributions. First is a novel composite weighted entropy based objective for prediction accuracy which is optimized along with a representation matching loss for fairness. We experimentally verify that optimizing with our loss formulation outperforms a number of state-of-the-art baselines in the pareto sense with respect to the fairness-accuracy tradeoff on several standard datasets. Our second contribution is a new setting we term Asymmetric Covariate Shift that, to the best of our knowledge, has not been studied before. Asymmetric covariate shift occurs when distribution of covariates of one group shifts significantly compared to the other groups and this happens when a dominant group is over-represented. While this setting is extremely challenging for current baselines, We show that our proposed method significantly outperforms them. Our third contribution is theoretical, where we show that our weighted entropy term along with prediction loss on the training set approximates test loss under covariate shift. Empirically and through formal sample complexity bounds, we show that this approximation to the unseen test loss does not depend on importance sampling variance which affects many other baselines. View details
    Bi-Phone: Modeling Inter Language Phonetic Influences in Text
    Ananya B. Sai
    Yuri Vasilevski
    James Ren
    Ambarish Jash
    Sukhdeep Sodhi
    ACL, Association for Computational Linguistics, Toronto, Canada (2023), 2580–2592
    Preview abstract A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the SuperGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text. View details
    Preview abstract Learning from label proportions (LLP) is a generalization of supervised learning in which the training data is available as sets or bags of feature-vectors (instances) along with the average instance-label of each bag. The goal is to train a good instance classifier. While most previous works in LLP have focused on training models on such training data, computational learnability in LLP only recently been explored by [Saket21,Saket22], who showed worst case intractability of properly learning linear threshold functions (LTFs) from label proportions while not ruling out efficient algorithms for this problem under distributional assumptions. In this work we show that it is indeed possible to efficiently learn LTFs using LTFs when given access to random bags of some label proportion in which feature-vectors are independently sampled from a fixed Gaussian distribution N(mu, Sigma), conditioned on the label assigned by the target LTF. Our method estimates a matrix by sampling pairs of feature-vector from the bags with and without replacement, and we prove that the principal component of this matrix necessarily yields the normal vector of the LTF. For some special cases with N(0, I) we provide a simpler expectation based algorithm. We include an experimental evaluation of our learning algorithms along with a comparison of with those of [Saket21, Saket22] and random LTFs, demonstrating the effectiveness of our techniques. View details
    Preview abstract We study the problem of adversarial attack and robustness on tabular datasets with discrete features. The discrete features of a tabular dataset represent high-level meaningful concepts, with different sets of vocabularies, leading to requiring non-uniform robustness. Further, the notion of distance between tabular input instances is not well defined, making the problem of producing adversarial examples with minor perturbations qualitatively more challenging compared to existing methods. Towards this, our paper defines the notion of distance through the lens of feature embeddings, learnt to represent the discrete features. We then formulate the task of generating adversarial examples as a binary set selection problem under non-uniform feature importance. Next, we propose an efficient approximate gradient-descent based algorithm, called Discrete Non-uniform Approximation (DNA) attack, by reformulating the problem into a continuous domain to solve the original optimization problem for generating adversarial examples. We demonstrate the effectiveness of our proposed DNA attack using two large real-world discrete tabular datasets from e-commerce domains for binary classification, where the datasets are heavily biased for one-class. We also analyze challenges for existing adversarial training frameworks for such datasets under our DNA attack. View details
    Preview abstract Unavailability of parallel corpora for training text style transfer (TST) models is a very challenging yet common scenario. Also, TST models implicitly need to preserve the content while transforming a source sentence into the target style. To tackle these problems, an intermediate representation is often constructed that is devoid of style while still preserving the meaning of the source sentence. In this work, we study the usefulness of using Abstract Meaning Representation (AMR) graph as the intermediate style agnostic representation. We posit that semantic notations like AMR are a natural choice for an intermediate representation. Hence, we propose the \textbf{T-STAR} model comprising of two components, text-to-AMR and AMR-to-text. We ensure that the intermediate representation is style agnostic, and use style-aware pretraining to improve the AMR-to-text performance. We show that the proposed model outperforms the state of the art TST models with improved content preservation and style accuracy numbers via automatic and human evaluations. View details
    Preview abstract We study the weak supervision learning problem of Learning from Label Proportions (LLP) where the goal is to learn an instance-level classifier using proportions of various class labels in a bag – a collection of input instances that often can be highly correlated. While representation learning for weakly-supervised tasks is found to be effective, they often require domain knowledge. To the best of our knowledge, representation learning for tabular data (unstructured data containing both continuous and categorical features) are not studied. In this paper, we propose to learn diverse representations of instances within the same bags to effectively utilize the weak bag-level supervision. We propose a domain agnostic LLP method, called "Self Contrastive Representation Learning for LLP" (SelfCLR-LLP) that incorporates a novel self– contrastive function as an auxiliary loss to learn representations on tabular data for LLP. We show that diverse representations for instances within the same bags aid efficient usage of the weak bag- level LLP supervision. We evaluate the proposed method through extensive experiments on real-world LLP datasets from e-commerce applications to demonstrate the effectiveness of our proposed SelfCLR-LLP. View details
    CoCoa : An Encoder-Decoder Model for Controllable Code-switched Generation
    Sneha Mondal
    Shreya Pathak
    Ritika Goyal
    Preethi Jyothi
    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, December 7 - December 11, 2022
    Preview abstract Code-switching has seen growing interest in recent years as an important multilingual NLP phenomenon. Generating code-switched text for data augmentation has been sufficiently well-explored. However, there is no prior work on generating code-switched text with fine-grained control on the degree of code-switching and the lexical choices used to convey formality. We present CoCoa, an encoder-decoder translation model that converts monolingual Hindi text to Hindi-English code-switched text with both encoder-side and decoder-side interventions to achieve fine-grained controllable generation. CoCoa can be invoked at test-time to synthesize code-switched text that is simultaneously faithful to syntactic and lexical attributes relevant to code-switching. CoCoa outputs were subjected to rigorous subjective and objective evaluations. Human evaluations establish that our outputs are of superior quality while being faithful to desired attributes. We show significantly improved BLEU scores when compared with human-generated code-switched references. Compared to competitive baselines, we show $10\%$ reduction in perplexity on a language modeling task and also demonstrate clear improvements on a downstream code-switched sentiment analysis task. View details
    Walking with PACE — Personalized and Automated Coaching Engine
    Deepak Nathani
    Eshan Motwani
    Karina Lorenzana Livingston
    Madhurima Vardhan
    Martin Gamunu Seneviratne
    Nur Muhammad
    Rahul Singh
    Shantanu Prabhat
    Srujana Merugu
    UMAP: 30th ACM Conference on User Modeling, Adaptation and Personalization (2022)
    Preview abstract Fitness coaching is effective in helping individuals to develop and maintain healthy lifestyle habits. However, there is a significant shortage of fitness coaches, particularly in low resource communities. Automated coaching assistants may help to improve the accessibility of personalized fitness coaching. Although a variety of automated nudge systems have been developed, few make use of formal behavior science principles and they are limited in their level of personalization. In this work, we introduce a computational framework leveraging the Fogg’s behavioral science model which serves as a personalised and automated coaching engine (PACE).PACE is a rule-based system that infers user state and suggests appropriate text nudges. We compared the effectiveness of PACE to human coaches in a Wizard-of-Oz deployment study with 33 participants over 21 days. Participants were randomized to either a human coach (’human’ arm, n=18) or the PACE framework handled by a human coach (’wizard’ arm, n=15). Coaches and participants interacted via a chat interface. We tracked coach-participant conversations, step counts and qualitative survey feedback. Our findings indicate that the PACE framework strongly emulated human coaching with no significant differences in the overall number of active days (PACE: 85.33% vs human: 92%, [p=NS]) and step count (PACE: 6674 vs human: 6605, [p=NS]) of participants from both groups.The qualitative user feedback suggests that PACE cultivated a coach-like experience, offering barrier resolution, motivation and educational support. As a post-hoc analysis, we annotated the conversation logs from the human coaching arm based on the Fogg framework, and then trained machine learning (ML) models on these data sets to predict the next coach action (AUC 0.73±0.02). This suggests that a ML-driven approach may be a viable alternative to a rule-based system in suggesting personalized nudges. In future, such an ML system could be made increasingly personalized and adaptive based on user behaviors. View details
    Preview abstract We formulate a new inference task in the domain of multivariate time series forecasting (MTSF), called Variable Subset Forecast (VSF), where only a subset of the variables are available during inference. Variables are absent during inference because of intermittent data collection issues (eg. sensor failures) or domain shift between train / test. To the best of our knowledge, robustness of MSTF models in presence of such failures, has not been studied in the literature. Through extensive evaluation, we first show that the performance of state of the art methods significantly degrade in this setting. We propose a non-parametric, wrapper technique that can be applied on top any existing forecast models. Through thorough experiments across 4 datasets and 5 forecast models, we show that our technique is able to recover the close to 95\% performance of the underlying models even when only 15\% of the original variables are present. View details
    Preview abstract In the framework of learning from label proportions (LLP) the goal is to learn a good instance-level label predictor from the observed label proportions of bags of instances. Most of the LLP algorithms either explicitly or implicitly assume the nature of bag distributions with respect to the actual labels and instances, or cleverly adapt supervised learning techniques to suit LLP. In practical applications however, the scale and nature of data could render such assumptions invalid and the many of the algorithms impractical. In this paper we address the hard problem of solving LLP with provable error bounds while being bag distribution agnostic and model agnostic. We first propose the concept of generalized bags, an extension of bags and then devise an algorithm to combine bag distributions, if possible, into good generalized bag distributions. We show that (w.h.p) any classifier optimizing the squared Euclidean label-proportion loss on such a generalized bag distribution is guaranteed to minimize the instance-level loss as well. The predictive quality of our method is experimentally evaluated and it equals or betters the previous methods on pseudo-synthetic and real-world datasets. View details
    Preview abstract Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve the effectiveness of the available BT data, we introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using \textit{both high and low quality} BT data by providing hints (as encoder tags) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as decoder tags) that provide information about the \textit{operation} required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: \{Hindi,Gujarati,Tamil\}$\rightarrow$English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings. View details
    No Results Found