Jilin Chen

Jilin Chen

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks. View details
    Preview abstract Language models still struggle on moral reasoning, despite their impressive performance in many other tasks. In particular, the Moral Scenarios task in MMLU (Multi-task Language Understanding) is among the worst performing tasks for many language models, including GPT-3. In this work, we propose a new prompting framework, Thought Experiments, to teach language models to do better moral reasoning using counterfactuals. Experiment results show that our framework elicits counterfactual questions and answers from the model, which in turn helps improve the accuracy on Moral Scenarios task by 9-16% compared to other zero-shot baselines. Interestingly, unlike math reasoning tasks, zero-shot Chain-of-Thought (CoT) reasoning doesn't work out of the box, and even reduces accuracy by around 4% compared to direct zero-shot. We further observed that with minimal human supervision in the form of 5 few-shot examples, the accuracy of the task can be improved to as much as 80%. View details
    Towards A Scalable Solution for Improving Multi-Group Fairness in Compositional Classification
    Tina Tian
    Ben Packer
    Meghana Deodhar
    Alex Beutel
    The Second Workshop on Spurious Correlations, Invariance and Stability @ ICML 2023 (2023)
    Preview abstract Despite the rich literature on machine learning fairness, relatively little attention has been paid to remediating complex systems, where the final prediction is the combination of multiple classifiers and where multiple groups are present. In this paper, we first show that natural baseline approaches for improving equal opportunity fairness scale linearly with the product of the number of remediated groups and the number of remediated prediction labels, rendering them impractical. We then introduce two simple techniques, called task-overconditioning and group-interleaving, to achieve a constant scaling in this multi-group multi-label setup. Our experimental results in academic and real-world environments demonstrate the effectiveness of our proposal at mitigation within this environment. View details
    A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures
    Allison Mercurio
    Amanda Elizabeth Baughan
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems Pages (2023)
    Preview abstract Despite huge gains in performance in natural language understanding via large language models in recent years, voice assistants still often fail to meet user expectations. In this study, we conducted a mixed-methods analysis of how voice assistant failures affect users' trust in their voice assistants. To illustrate how users have experienced these failures, we contribute a crowdsourced dataset of 199 voice assistant failures, categorized across 12 failure sources. Relying on interview and survey data, we find that certain failures, such as those due to overcapturing users' input, derail user trust more than others. We additionally examine how failures impact users' willingness to rely on voice assistants for future tasks. Users often stop using their voice assistants for specific tasks that result in failures for a short period of time before resuming similar usage. We demonstrate the importance of low stakes tasks, such as playing music, towards building trust after failures. View details
    Preview abstract Large pre-trained language models have shown remarkable performance over the past few years. These models, however, sometimes learn superficial features from the dataset and cannot generalize to the distributions that are dissimilar to the training scenario. There have been several approaches proposed to reduce model's reliance on these bias features which can improve model robustness in the out-of-distribution setting. However, existing methods usually use a fixed low-capacity model to deal with various bias features, which ignore the learnability of those features. In this paper, we analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. We further show that by choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design. View details
    A Human-ML Collaboration Framework for Improving Video Content Reviews
    Alex Beutel
    Alex Koes
    Meghana Deodhar
    Yixin Cai
    ACM CIKM 2022 Workshop Human-in-the-Loop Data Curation (2022)
    Preview abstract We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform. High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions, within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML) collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels are used to train segment-level models, the predictions of which are displayed as "hints" to human raters, indicating probable regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8% AUC improvement in the hint generation models. View details
    Preview abstract Most literature in fairness has focused on improving fairness with respect to one single model or one single objective. However, real-world machine learning systems are usually composed of many different components. Unfortunately, recent research has shown that even if each component is "fair", the overall system can still be "unfair". In this paper, we focus on how well fairness composes over multiple components in real systems. We consider two recently proposed fairness metrics for rankings: exposure and pairwise ranking accuracy gap. We provide theory that demonstrates a set of conditions under which fairness of individual models does compose. We then present an analytical framework for both understanding whether a system's signals can achieve compositional fairness, and diagnosing which of these signals lowers the overall system's end-to-end fairness the most. Despite previously bleak theoretical results, on multiple data-sets -- including a large-scale real-world recommender system -- we find that the overall system's end-to-end fairness is largely achievable by improving fairness in individual components. View details
    Preview abstract As multi-task models gain popularity in a wider range of machine learning applications, it is becoming increasingly important for practitioners to understand the fairness implications associated with those models. Most existing fairness literature focuses on learning a single task more fairly, while how ML fairness interacts with multiple tasks in the joint learning setting is largely under-explored. In this paper, we are concerned with how group fairness (e.g., equal opportunity, equalized odds) as an ML fairness concept plays out in the multi-task scenario. In multi-task learning, several tasks are learned jointly to exploit task correlations for a more efficient inductive transfer. This presents a multi-dimensional Pareto frontier on (1) the trade-off between group fairness and accuracy with respect to each task, as well as (2) the trade-offs across multiple tasks. We aim to provide a deeper understanding on how group fairness interacts with accuracy in multi-task learning, and we show that traditional approaches that mainly focus on optimizing the Pareto frontier of multi-task accuracy might not perform well on fairness goals. We propose a new set of metrics to better capture the multi-dimensional Pareto frontier of fairness-accuracy trade-offs uniquely presented in a multi-task learning setting. We further propose a Multi-Task-Aware Fairness (MTA-F) approach to improve fairness in multi-task learning. Experiments on several real-world datasets demonstrate the effectiveness of our proposed approach. View details
    Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective
    Aditee Ajit Kumthekar
    Alex Beutel
    Li Wei
    Nick Blumm
    Pranjal Awasthi
    Trevor Potter
    AIES (2021)
    Preview abstract In this work we study the problem of measuring the fairness of a machine learning model under noisy information. In many applications, evaluating a model according to a well-specified metric such as the FPR requires access to variables that cannot be jointly observed in a given practical setting. A standard workaround is to then use proxies for one or more of these variables. These proxies are either obtained using domain expertise or by training another machine learning model. Prior works have demonstrated the dangers of using such an approach, and strong independence assumptions are needed to provide guarantees on the accuracy of the noisy estimates via proxies. In contrast, in this work we present a general theoretical framework that aims to characterize weaker conditions under which accurate model auditing is possible via the above approach. Furthermore, our theory identifies potential sources of errors and decouples them into two interpretable parts Epsilon_c and Epsilon_g. The first part depends on natural properties of the proxy such as precision and recall, whereas the second part captures correlations between different variables of interest. We show that in many scenarios the error in the estimates is dominated by the Epsilon_c via a linear dependence, whereas the dependence on the correlations only constitutes a lower order term. As a result we expand the understanding of scenarios where model auditing via proxies can be an effective approach. Finally, we compare via simulations the theoretical upper-bounds to the distribution of simulated estimation errors and show that both theoretical guarantees and empirical results significantly improve as we progressively enforce structure along the conditions highlighted by the theory. View details
    Preview abstract NLP models are shown to suffer from robustness issues, for example, a model's prediction can be easily changed under small perturbations to the input. In this work, we aim to present a Controlled Adversarial Text Generation (CAT-Gen) model that, given an input text, it can generate adversarial texts through controllable attributes that are known to be invariant to task labels. For example, for a main task like sentiment classification, an example attribute can be different categories/domains, and a model should have similar performance across them; for a coreference resolution task, a model's performance should not differ across different demographic attributes. Different from many existing adversarial text generation approaches, we show that our model can generate adversarial texts that are more fluent, diverse, and with better task-label invariance guarantees. We aim to use this model to generate counterfactual texts that could better improve robustness in NLP models (e.g., through adversarial training), and we argue that our generation can create more natural attacks. View details