Jasmijn Bastings
Authored Publications
Sort By
Preview abstract
Misgendering is the act of referring to someone in way that does not reflect their gender identity. Translation systems, including foundation models capable of translation, can produce errors that result in misgendering harms. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both dedicated neural machine translation systems and foundation models, and show that all systems exhibit errors resulting in misgendering harms, even in high resource languages.
View details
Scaling Vision Transformers to 22 Billion Parameters
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Justin Gilmer
Mathilde Caron
Rodolphe Jenatton
Lucas Beyer
Michael Tschannen
Anurag Arnab
Carlos Riquelme
Gamaleldin Elsayed
Fisher Yu
Avital Oliver
Fantine Huot
Mark Collier
Vighnesh Birodkar
Yi Tay
Alexander Kolesnikov
Filip Pavetić
Thomas Kipf
Xiaohua Zhai
Neil Houlsby
Arxiv (2023)
Preview abstract
The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
View details
Preview abstract
We investigate a formalism for the conditions of a successful explanation of AI. We consider “success” to depend not only on what information the explanation contains, but also on what information the human explainee understands from it. Theory of mind literature discusses the folk concepts that humans use to understand and generalize behavior. We posit that folk concepts of behavior provide us with a “language” that humans understand behavior with. We use these folk concepts as a framework of social attribution by the human explainee—the information constructs that humans are likely to comprehend from explanations—by introducing a blueprint for an explanatory narrative (Figure 1) that explains AI behavior with these constructs. We then demonstrate that many XAI methods today can be mapped to folk concepts of behavior in a qualitative evaluation. This allows us to uncover their failure modes that prevent current methods from explaining successfully—ie, the information constructs that are missing for any given XAI method, and whose inclusion can decrease the likelihood of misunderstanding AI behavior.
View details
Will you Find these Shortcuts? A Protocol for Evaluating Faithfulness of Input Salience Methods for Text Classification
Sebastian Ebert
Proceedings of EMNLP 2022 (to appear)
Preview abstract
Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may produce surprisingly different results for the same model on the same input. While differences are expected if disparate definitions of importance are assumed, most methods claim to provide faithful attributions and point at features most relevant for a model's prediction. Existing work on faithfulness evaluation is not conclusive and does not provide a clear answer as to how different methods are to be compared.
Focusing on text classification and the model debugging scenario, we propose a protocol for faithfulness evaluation which makes use of partially synthetic data to obtain ground truth for feature importance ranking.
Following the protocol, we do an in-depth analysis of four standard salience method classes on a range of datasets and shortcuts for BERT and LSTM models. We demonstrate that some of the most common method configurations provide poor results even for simplest shortcuts while a method judged to be too simplistic works remarkably well for BERT.
View details
The MultiBERTs: BERT Reproductions for Robustness Analysis
Steve Yadlowsky
Jason Wei
Naomi Saphra
Iulia Raluca Turc
Preview abstract
Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.
View details
Preview abstract
Transformer-based language models are being pre-trained on ever growing datasets (hundreds of gigabytes) using ever growing amounts of parameters (millions to billions).
These large amounts of training data are typically scraped from the public web, and may contain (public) personally identifiable information such as names and phone numbers.
Moreover, recent findings also show that the capacity of these models allow them to memorize parts of the training data. One defense against such memorization that has not yet been fully explored in this context is differential privacy (DP).
We focus on T5, a popular encoder-decoder, and show that by using recent advances in JAX and XLA we can train models with DP that do not suffer a big drop in utility, nor in training speed, and can still be fine-tuned to high accuracies on downstream tasks such as GLUE. Moreover, we show that T5's span corruption pre-training task, unlike next token prediction, is a good defense against data memorization.
View details
We need to talk about random splits
Anders Østerskov Søgaard
Sebastian Ebert
Proceeding of the 2021 Conference of the European Chapter of the Association for Computational Linguistics (EACL) (to appear)
Preview abstract
Gorman and Bedrick (2019) argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. We can also split data in biased or adversarial ways, e.g., training on short sentences and evaluating on long ones. Biased sampling has been used in domain adaptation to simulate real-world drift; this is known as the covariate shift assumption. In NLP, however, even worst-case splits, maximizing bias, often under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This invalidates the covariate shift assumption. Instead of using multiple random splits, future benchmarks should ideally include multiple, independent test sets instead; if infeasible, we argue that multiple biased splits leads to more realistic performance estimates than multiple random splits.
View details
The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Preview abstract
There is a recent surge of papers that focus on attention as explanation of model predictions, giving mixed evidence on whether attention can be used as such. This has led some to try and `improve' attention so as to make it more interpretable. We argue that we should pay attention no heed.
While attention conveniently gives us one weight per input token and is easily extracted, it is often unclear towards what goal it is used as explanation. We argue that often that goal, whether explicitly stated or not, is to find out what input tokens are the most relevant to a prediction. When that is the case, input saliency methods better suit our needs, and there are no compelling reasons to use attention, despite the coincidence that it provides a weight for each input. With this position paper, we hope to shift some of the recent focus on attention to saliency methods, and for authors to clearly state the goal for their explanations.
View details
The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
Andy Coenen
Sebastian Gehrmann
Ellen Jiang
Carey Radebaugh
Ann Yuan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics (to appear)
Preview abstract
We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit.
View details