Ben Hutchinson

Ben Hutchinson

Ben Hutchinson is a Research Scientist in Google's Research, in Google's Responsible AI and Human-Centered Technology team. His research includes learning from various disciplines to inform the ethical development of AI. Prior to joining Google Research, he spent ten years working on a variety of products such as Google Wave, Google Maps, Knowledge Graph, Google Search, Social Impact, and others. He now uses this experience to work closely with product teams as a consultant on responsible practices and the development of responsible data sets machine learning models. He has a PhD in Natural Language Processing from the University of Edinburgh, and undergraduate degrees in linguistics and mathematics.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people. View details
    Preview abstract Settler colonialism has led to ancestral language endangerment and extinction on a mass scale. It has also forced `global' languages such as English on Indigenous communities worldwide. In Australia, post-contact languages, including creoles, and local varieties of international languages emerged as a result of forced contact with English speakers. These contact varieties are widely used, but to date they have to-date been poorly supported by language technologies. This oversight presents barriers to participation in civil and economic society for Indigenous communities using these languages. It also reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns the question of whether (and, if so, how) Indigenous people may be supported by technologies for their non-ancestral languages. We argue that multiple real-world opportunities exist, and explore this position through a case study of a project which aims to improve Automated Speech Recognition for Australian Aboriginal English. We discuss how we integrated culturally appropriate processes into the project. We call for increased support for languages used by Indigenous communities, including contact varieties, providing practical economic and socio-cultural benefits. View details
    Preview abstract In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people. View details
    Preview abstract In January 2025, over forty Aboriginal and Torres Strait Islander researchers, practitioners, community members, and allies, gathered at the Centre for Global Indigenous Futures at the Wallumattagal Campus of Macquarie University in Sydney to envisage Aboriginal and Torres Strait Islander AI futures. This publication reports on attendees' vision for the future of AI for Aboriginal and Torres Strait Islander people. View details
    Preview abstract Indigenous languages are historically under-served by Natural Language Processing (NLP) technologies, but this is changing for some languages with the recent scaling of large multilingual models and an increased focus by the NLP community on endangered languages. This position paper explores ethical considerations in building NLP technologies for Indigenous languages, based on the premise that such projects should primarily serve Indigenous communities. We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities on language technology projects in Australia. Drawing on insights from the interviews, we recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities, rather than focusing only on decontextualised artefacts. View details
    Socially Responsible Data for Large Multilingual Language Models
    Zara Wudiri
    Mbangula Lameck Amugongo
    Alex
    Stanley Uwakwe
    João Sedoc
    Edem Wornyo
    Seyi Olojo
    Amber Ebinama
    Suzanne Dikker
    2024
    Preview abstract Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years but their training data is largely English text. There is growing interest in language inclusivity in LLMs, and various efforts are striving for models to accommodate language communities outside of the Global North1 , which include many languages that have been historically underrepresented digitally. These languages have been coined as “low resource languages” or “long tail languages”, and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North. View details
    Preview abstract As Generative AI (GenAI) systems increasingly enter our daily lives, reshaping social norms and practices, we must examine the norms and practices we use to evaluate the systems themselves. Recent scholarship has started to make explicit the normative dimensions of Machine Learning (ML) development and evaluation. \citet{birhane2022values} demonstrate that particular normative values are encoded in Machine Learning (ML) practice. \citet{hutchinson2022evaluation}, in a review of ML evaluation practices, identify several commitments implicit in the way ML models are evaluated. These include a commitment to consequentialism, the assumptions that evaluations can be undertaken acontextually and that model inputs need only play a limited during model evaluation, and the expectations that impacts can be quantified and that ML failure modes are commensurable. In this provocation, we extend this line of inquiry by arguing two points: we need to attend to the implicit assumptions and values reflected in how societal impacts are conceptualised and constructed through ML evaluations; and doing so reveals that many of the problems that societal impact evaluations attempt to address would be better conceptualised as governance, rather than evaluation, issues. View details
    Preview abstract Questions regarding implicitness, ambiguity and underspecification are crucial for multimodal image+text systems, but have received little attention to date. This paper maps out a conceptual framework to address this gap for systems which generate images from text inputs, specifically for systems which generate images depicting scenes from descriptions of those scenes. In doing so, we account for how texts and images convey different forms of meaning. We then outline a set of core challenges concerning textual and visual ambiguity and specificity tasks, as well as risks that may arise from improper handling of ambiguous and underspecified elements. We propose and discuss two strategies for addressing these challenges: a) generating a visually ambiguous output image, and b) generating a set of diverse output images. View details
    Preview abstract In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more robustly examine the trustworthiness of ML models. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Jacob Austin
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Katherine Lee
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Kathy Meier-Hellstern
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details