Jump to Content
Ben Hutchinson

Ben Hutchinson

Ben Hutchinson is a Research Scientist in Google's Research, in Google's Responsible AI and Human-Centered Technology team. His research includes learning from various disciplines to inform the ethical development of AI. Prior to joining Google Research, he spent ten years working on a variety of products such as Google Wave, Google Maps, Knowledge Graph, Google Search, Social Impact, and others. He now uses this experience to work closely with product teams as a consultant on responsible practices and the development of responsible data sets machine learning models. He has a PhD in Natural Language Processing from the University of Edinburgh, and undergraduate degrees in linguistics and mathematics.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Testing, within the machine learning (ML) community, has been predominantly about assessing a learned model's predictive performance measured against a test dataset. This test dataset is often a held-out subset of the dataset used to train the model, and hence expected to follow the same data distribution as the training dataset. While recent work on robustness testing within ML has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. In this paper, we argue that this view of testing actively discourages researchers and developers from looking into many other sources of robustness failures, for instance corner cases. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice. View details
    PaLM: Scaling Language Modeling with Pathways
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    LaMDA: Language Models for Dialog Applications
    Aaron Daniel Cohen
    Alena Butryna
    Alicia Jin
    Apoorv Kulshreshtha
    Ben Zevenbergen
    Chung-ching Chang
    Cosmo Du
    Daniel De Freitas Adiwardana
    Dehao Chen
    Dmitry (Dima) Lepikhin
    Erin Hoffman-John
    Igor Krivokon
    James Qin
    Jamie Hall
    Joe Fenton
    Johnny Soraker
    Maarten Paul Bosma
    Marc Joseph Pickett
    Marcelo Amorim Menegali
    Marian Croak
    Maxim Krikun
    Noam Shazeer
    Rachel Bernstein
    Ravi Rajakumar
    Ray Kurzweil
    Romal Thoppilan
    Steven Zheng
    Taylor Bos
    Toju Duke
    Tulsee Doshi
    Vincent Y. Zhao
    Will Rusch
    Yuanzhong Xu
    arXiv (2022)
    Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
    Preview abstract Questions regarding implicitness, ambiguity and underspecification are crucial for multimodal image+text systems, but have received little attention to date. This paper maps out a conceptual framework to address this gap for systems which generate images from text inputs, specifically for systems which generate images depicting scenes from descriptions of those scenes. In doing so, we account for how texts and images convey different forms of meaning. We then outline a set of core challenges concerning textual and visual ambiguity and specificity tasks, as well as risks that may arise from improper handling of ambiguous and underspecified elements. We propose and discuss two strategies for addressing these challenges: a) generating a visually ambiguous output image, and b) generating a set of diverse output images. View details
    Preview abstract In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more robustly examine the trustworthiness of ML models. View details
    Preview abstract Conventional algorithmic fairness is West-centric, as seen in its sub-groups, values, and optimisations. In this paper, we de-center algorithmic fairness and analyse AI power in India. Based on 36 qualitative interviews and a discourse analysis of algorithmic deployments in India, we find that several assumptions of algorithmic fairness are challenged in India. We find that data is not always reliable due to socio-economic factors, users are given third world treatment by ML makers, and AI signifies unquestioning aspiration. We contend that localising model fairness alone can be window dressing in India, where the distance between models and oppressed communities is large. Instead, we re-imagine algorithmic fairness in India and provide a roadmap to re-contextualise data and models, empower oppressed communities, and enable Fair-ML ecosystems. View details
    Preview abstract Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation. View details
    Diversity and Inclusion Metrics for Subset Selection
    Margaret Mitchell
    Dylan Baker
    Nyalleng Moorosi
    Alex Hanna
    Timnit Gebru
    Jamie Morgenstern
    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), ACM (2020)
    Preview abstract The concept of fairness has recently been applied in machine learning settings to describe a wide range of constraints and objectives. When applied to ranking, recommendation, or subset selection problems for an individual, it becomes less clear that fairness goals are more applicable than goals that prioritize diverse outputs and instances that represent the individual's goals well. In this work, we discuss the relevance of the concept of fairness to the concepts of diversity and inclusion, and introduce metrics that quantify the diversity and inclusion of an instance or set. Diversity and inclusion metrics can be used in tandem, including additional fairness constraints, or may be used separately, and we detail how the different metrics interact. Results from human subject experiments demonstrate that the proposed criteria for diversity and inclusion are consistent with social notions of these two concepts, and human judgments on the diversity and inclusion of example instances are correlated with the defined metrics. View details
    Fairness Preferences, Actual and Hypothetical: A Study of Crowdworker Incentives
    Angie Peng
    Jeff Naecker
    Nyalleng Moorosi
    Proceedings of ICML 2020 Workshop on Participatory Approaches to Machine Learning (to appear)
    Preview abstract How should we decide which fairness criteria or definitions to adopt in machine learning systems? To answer this question, we must study the fair- ness preferences of actual users of machine learn- ing systems. Stringent parity constraints on treat- ment or impact can come with trade-offs, and may not even be preferred by the social groups in question (Zafar et al., 2017). Thus it might be beneficial to elicit what the group’s prefer- ences are, rather than rely on a priori defined mathematical fairness constraints. Simply asking for self-reported rankings of users is challenging because research has shown that there are often gaps between people’s stated and actual prefer- ences(Bernheim et al., 2013). This paper outlines a research program and ex- perimental designs for investigating these ques- tions. Participants in the experiments are invited to perform a set of tasks in exchange for a base payment—they are told upfront that they may receive a bonus later on, and the bonus could de- pend on some combination of output quantity and quality. The same group of workers then votes on a bonus payment structure, to elicit preferences. The voting is hypothetical (not tied to an outcome) for half the group and actual (tied to the actual payment outcome) for the other half, so that we can understand the relation between a group’s actual preferences and hypothetical (stated) pref- erences. Connections and lessons from fairness in machine learning are explored. View details
    Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditin
    Becky White
    Inioluwa Deborah Raji
    Margaret Mitchell
    Timnit Gebru
    FAT* Barcelona, 2020, ACM Conference on Fairness, Accountability, and Transparency (ACM FAT* (2020)
    Preview abstract Rising concern for the societal implications of artificial intelligencesystems has inspired a wave of academic and journalistic literaturein which deployed systems are audited for harm by investigatorsfrom outside the organizations deploying the algorithms. However,it remains challenging for practitioners to identify the harmfulrepercussions of their own systems prior to deployment, and, oncedeployed, emergent issues can become difficult or impossible totrace back to their source.In this paper, we introduce a framework for algorithmic auditingthat supports artificial intelligence system development end-to-end,to be applied throughout the internal organization development life-cycle. Each stage of the audit yields a set of documents that togetherform an overall audit report, drawing on an organization’s valuesor principles to assess the fit of decisions made throughout the pro-cess. The proposed auditing framework is intended to contribute toclosing theaccountability gapin the development and deploymentof large-scale artificial intelligence systems by embedding a robustprocess to ensure audit integrity. View details
    Preview abstract Building equitable and inclusive technologies demands paying attention to how social attitudes towards persons with disabilities are represented within technology. Representations perpetuated by NLP models often inadvertently encode undesirable social biases from the data on which they are trained. In this paper, first we present evidence of such undesirable biases towards mentions of disability in two different NLP models: toxicity prediction and sentiment analysis. Next, we demonstrate that neural embeddings that are critical first steps in most NLP pipelines also contain undesirable biases towards mentions of disabilities. We then expose the topical biases in the social discourse about some disabilities which may explain such biases in the models; for instance, terms related to gun violence, homelessness, and drug addiction are over-represented in discussions about mental illness. View details
    Preview abstract In this paper we argue that standard calls for explainability that focus on the epistemic inscrutability of black-box machine learning models may be misplaced. If we presume, for the sake of this paper, that machine learning can be a source of knowledge, then it makes sense to wonder what kind of justification it involves. How do we rationalize on the one hand the seeming justificatory black box with the observed widespread adoption of machine learning? We argue that, in general, people implicitly adopt reliabilism regarding machine learning. Reliabilism is an epistemological theory of epistemic justification according to which a belief is warranted if it has been produced by a reliable process or method. We argue that, in cases where model deployments require moral justification, reliabilism is not sufficient, and instead justifying deployment requires establishing robust human processes as a moral “wrapper” around machine outputs. We then suggest that, in certain high-stakes domains with moral consequences, reliabilism does not provide another kind of necessary justification—moral justification. Finally, we offer cautions relevant to the (implicit or explicit)adoption of the reliabilist interpretation of machine learning. View details
    Model Cards for Model Reporting
    Elena Spitzer
    Inioluwa Deborah Raji
    M. Mitchell
    Simone Sanoian McCloskey Wu
    Timnit Gebru
    (2019)
    Preview abstract Trained machine learning models are increasingly used to perform high impact tasks such as determining crime recidivism rates and predicting health risks. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts they are not well-suited for, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards (or M-cards) to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic subgroups (e.g., race, geographic location, sex, Fitzpatrick skin tone) and intersectional subgroups (e.g., age and race, or sex and Fitzpatrick skin tone) that are relevant to the intended application domains. Model cards also disclose the context under which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for models trained to detect smiling faces on the CelebA dataset (Liu et al., 2015) and models trained to detect toxicity in the Conversation AI dataset (Dixon et al., 2018). We propose this work as a step towards the responsible democratization of machine learning and related AI technology, providing context around machine learning models and increasing the transparency into how well such models work. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed documentation. View details
    Preview abstract Data-driven statistical Natural Language Processing (NLP) techniques leverage large amounts of language data to build models that can understand language. However, most language data reflect the public discourse at the time the data was produced, and hence NLP models are susceptible to learning incidental associations around named referents at a particular point in time, in addition to general linguistic meaning. An NLP system designed to model notions such as sentiment and toxicity should ideally produce scores that are independent of the identity of such entities mentioned in text and their social associations. For example, in a general purpose sentiment analysis system, a phrase such as I hate Katy Perry should be interpreted as having the same sentiment as I hate Taylor Swift. Based on this idea, we propose a generic evaluation framework, Perturbation Sensitivity Analysis, which detects unintended model biases related to named entities, and requires no new annotations or corpora. We demonstrate the utility of this analysis by employing it on two different NLP models --- a sentiment model and a toxicity model --- applied on online comments in English language from four different genres. View details
    Advances and Open Problems in Federated Learning
    Brendan Avent
    Aurélien Bellet
    Mehdi Bennis
    Arjun Nitin Bhagoji
    Graham Cormode
    Rachel Cummings
    Rafael G.L. D'Oliveira
    Salim El Rouayheb
    David Evans
    Josh Gardner
    Adrià Gascón
    Phillip B. Gibbons
    Marco Gruteser
    Zaid Harchaoui
    Chaoyang He
    Lie He
    Zhouyuan Huo
    Justin Hsu
    Martin Jaggi
    Tara Javidi
    Gauri Joshi
    Mikhail Khodak
    Jakub Konečný
    Aleksandra Korolova
    Farinaz Koushanfar
    Sanmi Koyejo
    Tancrède Lepoint
    Yang Liu
    Prateek Mittal
    Richard Nock
    Ayfer Özgür
    Rasmus Pagh
    Ramesh Raskar
    Dawn Song
    Weikang Song
    Sebastian U. Stich
    Ziteng Sun
    Florian Tramèr
    Praneeth Vepakomma
    Jianyu Wang
    Li Xiong
    Qiang Yang
    Felix X. Yu
    Han Yu
    Arxiv (2019)
    Preview abstract Federated learning (FL) is a machine learning setting where many clients (e.g., mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and mitigates many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents a comprehensive list of open problems and challenges. View details
    Preview abstract Machine learning is often viewed as an inherently value-neutral process: statistical tendencies in the training inputs are ``simply'' used to generalize to new examples. However when models impact social systems such as interactions between humans, these patterns learned by models have normative implications. It is important that we ask not only ``what patterns exist in the data?'', but also ``how do we want our system to impact people?'' In particular, because minority and marginalized members of society are often statistically underrepresented in data sets, models may have undesirable disparate impact on such groups. As such, objectives of social equity and distributive justice require that we develop tools for both identifying and interpreting harms introduced by models. This paper directly addresses the challenge of interpreting how human values are implicitly encoded by deep neural networks, a machine learning paradigm often seen as inscrutable. Doing so requires understanding how the node activations of neural networks relate to value-laden human concepts such as {\sc respectful} and {\sc abusive}, as well as to concepts about human social identities such as {\sc gay}, {\sc straight}, {\sc male}, {\sc female}, etc. To do this, we present the first application of Testing with Concept Activation Vectors ({\sc tcav}; \cite{kim2018interpretability}) to models for analyzing human language. View details
    Preview abstract Quantitative definitions of what is unfair and what is fair have been introduced in multiple disciplines for well over 50 years, including in education, hiring, and machine learning. We trace how the notion of fairness has been defined within the testing communities of education and hiring over the past half century, exploring the cultural and social context in which different fairness definitions have emerged. In some cases, earlier definitions of fairness are similar or identical to definitions of fairness in current machine learning research, and foreshadow current formal work. In other cases, insights into what fairness means and how to measure it have largely gone overlooked. We compare past and current notions of fairness along several dimensions, including the fairness criteria, the focus of the criteria (e.g., a test, a model, or its use), the relationship of fairness to individuals, groups, and subgroups, and the mathematical method for measuring fairness (e.g., classification, regression). This work points the way towards future research and measurement of (un)fairness that builds from our modern understanding of fairness while incorporating insights from the past. View details
    Preview abstract Persons with disabilities face many barriers to participation in society, and the rapid advancement of technology creates ever more. Achieving fair opportunity and justice for people with disabilities demands paying attention not just to accessibility, but also to the attitudes towards, and representations of, disability that are implicit in machine learning (ML) models that are pervasive in how one engages with the society. However such models often inadvertently learn to perpetuate undesirable social biases from the data on which they are trained. This can result, for example, in models for classifying text producing very different predictions for {\em I stand by a person with mental illness}, and {\em I stand by a tall person}. We present evidence of such social biases in existing ML models, along with an analysis of biases in a dataset used for model development. View details
    Detecting Bias with Generative Counterfactual Face Attribute Augmentation
    Margaret Mitchell
    Timnit Gebru
    Fairness, Accountability, Transparency and Ethics in Computer Vision Workshop (in conjunction with CVPR) (2019)
    Preview abstract We introduce a simple framework for identifying biases of a smiling attribute classifier. Our method poses counterfactual questions of the form: how would the prediction change if this face characteristic had been different? We leverage recent advances in generative adversarial networks to build a realistic generative model of faces that affords controlled manipulation of specific facial characteristics. Empirically, we identify several different factors of variation (that we believe should be in-dependent of a smiling) that affect the predictions of a smiling classifier trained on CelebA. View details
    Using the web for language independent spellchecking and autocorrection
    Casey Whitelaw
    Grace Y. Chung
    Gerard Ellis
    EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Morristown, NJ, USA, pp. 890-899
    Preview
    No Results Found