Skip to main content

Explore our many areas of focus

Explore all research areas

Applied AI & sciences

Sustainability & crisis resilience

Foundational ML & algorithms

Algorithms & theory

Information retrieval

Machine intelligence

Machine perception

Natural language processing

People, systems & quantum AI

Human-computer interaction and visualization

Software engineering

Software systems

Learn More

Building a collaborative ecosystem

Access high-quality datasets to accelerate your research.

Tools & services

Explore our latest AI models and products.

Discover open-source code and collaborate with the community.

Shaping the future together

See all programs

Faculty programs

Participating in the academic research community through meaningful engagement with university faculty.

Student programs

Supporting the next generation of researchers through a wide range of programming.

Find your place in our global offices and research labs.

Translating discovery into real-world impact

Our researchers drive advancements in computer science through both fundamental and applied research.

Collaborative groups tackling the world's most challenging AI problems.

Research

Explore our many areas of focus

Explore all research areas

Applied AI & sciences

Sustainability & crisis resilience

Foundational ML & algorithms

Algorithms & theory

Information retrieval

Machine intelligence

Machine perception

Natural language processing

People, systems & quantum AI

Human-computer interaction and visualization

Software engineering

Software systems

Learn More

Resources

Building a collaborative ecosystem

Access high-quality datasets to accelerate your research.

Tools & services

Explore our latest AI models and products.

Discover open-source code and collaborate with the community.

Conferences & events

Careers

Shaping the future together

See all programs

Faculty programs

Participating in the academic research community through meaningful engagement with university faculty.

Student programs

Supporting the next generation of researchers through a wide range of programming.

Find your place in our global offices and research labs.

Blog

About

Translating discovery into real-world impact

Our researchers drive advancements in computer science through both fundamental and applied research.

Collaborative groups tackling the world's most challenging AI problems.

Google Research

Learn about all our AI

Google DeepMind

Explore the frontier of AI

Try our AI experiments

Conferences & events

Blog

Katherine Heller

Home
People

Katherine Heller

Katherine is a research scientist in Responsible AI at Google Research, and a member of Context in AI Research (CAIR) team. She works on Machine Learning (ML) research in Healthcare, Vision, Language, and Creativity, focusing on incorporating values for Transparency, Inclusivity, Fairness, and Robustness in our research. Prior to joining Google, she was Statistical Science faculty at Duke University, where she developed a sepsis detection system now in use at Duke University Hospital, and a nationally released iOS app which tries to complete the picture of peoples' Multiple Sclerosis course between clinic visits. Katherine received a BS in CS and Applied Math from SUNY Stony Brook, an MS in CS from Columbia University, and a PhD in Machine Learning from the Gatsby Computational Neuroscience Unity at UCL. She was then a postdoc on an EPSRC fellowship in Engineering at the University of Cambridge, and an NSF postoc fellow in Brain and Cognitive Sciences at MIT.

Research Areas

Machine intelligence

Authored Publications

results

Filter by:

Publications

Google 21
Other 0

Years

2025 4
2024 7
2023 2
2022 3
2021 2
2020 3

Research Areas

Health & Bioscience 8
Machine Intelligence 8
Machine Translation 1
Natural Language Processing 2
Responsible AI 10

Teams

I-DRIM 2

Sort By

Title
Title, descending
Year
Year, descending

chip template

Understanding challenges to the validity of disaggregated evaluations for algorithmic fairness

Stephen Pfohl

Natalie Harris

Chirag Nagpal

David Madras

Vishwali Mhasawade

Olawale Salaudeen

Awa Dieng

Shannon Sequeira

Santiago Arciniegas

Lillian Sung

Nnamdi Ezeanochie

Heather Cole-Lewis

Katherine Heller

Sanmi Koyejo

Alexander D'Amour

Proceedings of the 2025 Conference on Neural Information Processing Systems (NeurIPS) (2025)

Preview abstract Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to characterize fairness properties and metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation. View details

AfriMed-QA: A Pan-African Multi-Specialty Medical Question-Answering Benchmark Dataset

Tobi Olatunji

Abraham Toluwase Owodunni

Charles Nimo

Jennifer Orisakwe

Henok Biadglign Ademtew

Chris Fourie

Foutse Yuehgoh

Jonas Kemp

Stephen Moore

Mardhiyah Sanni

Emmanuel Ayodele

Irfan Essa

Timothy Faniran

Bonaventure F. P. Dossou

Fola Omofoye

Wendy Kinara

Tassallah Abdullahi

Michael Best

Katherine Heller

Mercy Asiedu

2025

Preview abstract Recent advancements in large language model (LLM) performance on medical multiple-choice question (MCQ) benchmarks have stimulated significant interest from patients and healthcare providers globally. Particularly in low- and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, LLM training data is sourced from predominantly Western text, existing benchmarks are predominantly Western-centric, limited to MCQs, and focused on a narrow range of clinical specialties, raising concerns about their applicability in the Global South, particularly across Africa where localized medical knowledge and linguistic diversity are often underrepresented. In this work, we introduce AfriMed-QA, the first large-scale multi-specialty Pan-African medical Question-Answer (QA) dataset designed to evaluate and develop equitable and effective LLMs for African healthcare. It contains 3,000 multiple-choice professional medical exam questions with answers and rationale, 1,500 short answer questions (SAQ) with long-from answers, and 5,500 consumer queries, sourced from over 60 medical schools across 15 countries, covering 32 medical specialties. We further rigorously evaluate multiple open, closed, general, and biomedical LLMs across multiple axes including accuracy, consistency, factuality, bias, potential for harm, local geographic relevance, medical reasoning, and recall. We believe this dataset provides a valuable resource for practical application of large language models in African healthcare and enhances the geographical diversity of health-LLM benchmark datasets. View details

What Secrets Do Your Manifolds Hold? Understanding the Local Geometry of Generative Models

Imtiaz Humayun

Ibtihel Amara

Cristina Vasconcelos

Deepak Ramachandran

Candice Schumann

Junfeng He

Katherine Heller

Golnoosh Farnadi

Negar Rostamzadeh

Mohammad Havaei

ICLR 2025

Preview abstract Deep Generative Models are frequently used to learn continuous representations of complex data distributions by training on a finite number of samples. For any generative model, including pre-trained foundation models with Diffusion or Transformer architectures, generation performance can significantly vary across the learned data manifold. In this paper, we study the local geometry of the learned manifold and its relationship to generation outcomes for a wide range of generative models, including DDPM, Diffusion Transformer (DiT), and Stable Diffusion 1.4. Building on the theory of continuous piecewise-linear (CPWL) generators, we characterize the local geometry in terms of three geometric descriptors - scaling (ψ), rank (ν), and complexity/un-smoothness (δ). We provide quantitative and qualitative evidence showing that for a given latent vector, the local descriptors are indicative of post-generation aesthetics, generation diversity, and memorization by the generative model. Finally, we demonstrate that by training a reward model on the local scaling for Stable Diffusion, we can self-improve both generation aesthetics and diversity using geometry sensitive guidance during denoising. View details

Development and Evaluation of ML Models for Cardiotocography Interpretation

Nicole Chiou

Nichole Young-Lin

Abdoulaye Diack

Christopher Kelly

Julie Cattiau

Tiya Tiyasirichokchai

Sanmi Koyejo

Katherine Heller

Mercy Asiedu

NPJ Women's Health (2025)

Preview abstract The inherent variability in the visual interpretation of cardiotocograms (CTGs) by obstetric clinical experts, both intra- and inter-observer, presents a substantial challenge in obstetric care. In response, we investigate automated CTG interpretation as a potential solution to enhance the early detection of fetal hypoxia during labor, thereby reducing unnecessary operative interventions and improving overall maternal and neonatal care. This study employs deep learning techniques to reduce the subjectivity associated with visual CTG interpretation. Our results demonstrate that employing objective cord blood pH measurements, rather than clinician-defined Apgar scores, yields more consistent and robust model performance. Additionally, through a series of ablation studies, we investigate the impact of temporal distribution shifts on the performance of these deep learning models. We examine tradeoffs between performance and fairness, specifically evaluating performance across demographic and clinical subgroups. Finally, we discuss the practical implications of our findings for the real-world deployment of such systems, emphasizing their potential utility in medical settings with limited resources. View details

Nteasee: A qualitative study of expert and general population perspectives on deploying AI for health in African countries

Mercy Asiedu

Iskandar Haykel

Awa Dieng

Kerrie Kauer

Florence Ofori

Tousif Ahmad

Charisma Chan

Stephen Pfohl

Katherine Heller

2024

Preview abstract Background: Artificial Intelligence for health has the potential to significantly change and improve healthcare. However in most African countries identifying culturally and contextually attuned approaches for deploying these solutions is not well understood. To bridge this gap, we conduct a qualitative study to investigate the best practices, fairness indicators and potential biases to mitigate when deploying AI for health in African countries, as well as explore opportunities where artificial intelligence could make a positive impact in health. Methods: We used a mixed methods approach combining in-depth interviews (IDIs) and surveys. We conduct 1.5-2 hour long IDIs with 50 experts in health, policy and AI across 17 countries, and through an inductive approach we conduct a qualitative thematic analysis on expert IDI responses. We administer a blinded 30-minute survey with thought-cases to 672 general population participants across 5 countries in Africa (Ghana, South Africa, Rwanda, Kenya and Nigeria), and analyze responses on quantitative scales, statistically comparing responses by country, age, gender, and level of familiarity with AI. We thematically summarize open-ended responses from surveys. Results and Conclusion: Our results find generally positive attitudes, high levels of trust, accompanied by moderate levels of concern among general population participants for AI usage for health in Africa. This contrasts with expert responses, where major themes revolved around trust/mistrust, AI ethics concerns, and systemic barriers to overcome, among others. This work presents the first-of-its-kind qualitative research study of the potential of AI for health in Africa with perspectives from both experts and the general population. We hope that this work guides policy makers and drives home the need for education and the inclusion of general population perspectives in decision-making around AI usage. View details

TRINDs: Assessing the Diagnostic Capabilities of Large Language Models for Tropical and Infectious Diseases

Mercy Asiedu

Nenad Tomašev

Chintan Ghate

Tiya Tiyasirichokchai

Awa Dieng

Steve Adudans

Oluwatosin Akande

Sylvanus Aitkins

Geoffrey Siwo

Lynda Osadebe

Eric Ndombi

Katherine Heller

2024

Preview abstract Neglected tropical diseases (NTDs) and infectious diseases disproportionately affect the poorest regions of the world. While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific explorations. We introduce TRINDs, a dataset of 52 tropical and infectious diseases with demographic and semantic clinical and consumer augmentations. We evaluate various context and counterfactual locations to understand their influence on LLM performance. Results show that LLMs perform best when provided with contextual information such as demographics, location, and symptoms. We also develop TRINDs-LM, a tool that enables users to enter symptoms and contextual information to receive a most likely diagnosis. In addition to the LLM evaluations, we also conducted a human expert baseline study to assess the accuracy of human experts in diagnosing tropical and infectious diseases with 7 medical and public health experts. This work demonstrates methods for creating and evaluating datasets for testing and optimizing LLMs, and the use of a tool that could improve digital diagnosis and tracking of NTDs. View details

The Case for Globalizing Fairness: A Mixed Methods Study on the Perceptions of Colonialism, AI and Health in Africa

Mercy Asiedu

Awa Dieng

Iskandar Haykel

Negar Rostamzadeh

Stephen Pfohl

Chirag Nagpal

Aisha Walcott-Bryant

Sanmi Koyejo

Katherine Heller

2024

Preview abstract With growing machine learning (ML) and large language model applications in healthcare, there have been calls for fairness in ML to understand and mitigate ethical concerns these systems may pose. Fairness has implications for health in Africa, which already has inequitable power imbalances between the Global North and South. This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose fairness attributes for consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 625 general population study participants in 5 countries in Africa and 28 experts in ML, Health, and/or policy focussed on Africa to obtain feedback on the proposed attributes. We delve specifically into understanding the interplay between AI, health and colonialism. Our findings demonstrate that among experts there is a general mistrust that technologies that are solely developed by former colonizers can benefit Africans, and that associated resource constraints due to pre-existing economic and infrastructure inequities can be linked to colonialism. General population survey responses found about an average of 40% of people associate an undercurrent of colonialism to AI and this was most dominant amongst participants from South Africa. However the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism.Colonial history, country of origin, National income level were specific axes of disparities that participants felt would cause an AI tool to be biased This work serves as a basis for policy development around Artificial Intelligence for health in Africa and can be expanded to other regions. View details

TRINDs: Assessing the Diagnostic Capabilities of Large Language Models for Tropical and Infectious Diseases

Mercy Asiedu

Nenad Tomašev

Chintan Ghate

Tiya Tiyasirichokchai

Awa Dieng

Oluwatosin Akande

Geoffrey Siwo

Steve Adudans

Sylvanus Aitkins

Lynda Osadebe

Eric Ndombi

Odianosen Ehiakhamen

Katherine Heller

2024

Preview abstract Neglected tropical diseases (NTDs) and infectious diseases disproportionately affect the poorest regions of the world. While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific explorations. We introduce TRINDs, a dataset of 52 tropical and infectious diseases with demographic and semantic clinical and consumer augmentations. We evaluate various context and counterfactual locations to understand their influence on LLM performance. Results show that LLMs perform best when provided with contextual information such as demographics, location, and symptoms. We also develop TRINDs-LM, a tool that enables users to enter symptoms and contextual information to receive a most likely diagnosis. In addition to the LLM evaluations, we also conducted a human expert baseline study to assess the accuracy of human experts in diagnosing tropical and infectious diseases with 7 medical and public health experts. This work demonstrates methods for creating and evaluating datasets for testing and optimizing LLMs, and the use of a tool that could improve digital diagnosis and tracking of NTDs. View details

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Stephen Pfohl

Heather Cole-Lewis

Rory Sayres

Darlene Neal

Mercy Asiedu

Awa Dieng

Nenad Tomašev

Qazi Mamunur Rashid

Shekoofeh Azizi

Negar Rostamzadeh

Liam McCoy

Leo Anthony Celi

Yun Liu

Mike Schaekermann

Alanna Walton

Alicia Parrish

Chirag Nagpal

Preeti Singh

Akeiylah DeWitt

Philip Mansfield

Sushant Prakash

Katherine Heller

Alan Karthikesalingam

Christopher Semturs

Joelle Barral

Greg Corrado

Yossi Matias

Jamila Smith-Loud

Ivor Horn

Karan Singhal

Nature Medicine (2024)

Preview abstract Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. View details

The Case for Globalizing Fairness: A Mixed Methods Study on the Perceptions of Colonialism, AI and Health in Africa

Mercy Asiedu

Awa Dieng

Iskandar Haykel

Negar Rostamzadeh

Stephen Pfohl

Chirag Nagpal

Aisha Walcott-Bryant

Sanmi Koyejo

Katherine Heller

2024

Preview abstract With growing machine learning (ML) and large language model applications in healthcare, there have been calls for fairness in ML to understand and mitigate ethical concerns these systems may pose. Fairness has implications for health in Africa, which already has inequitable power imbalances between the Global North and South. This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose fairness attributes for consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 625 general population study participants in 5 countries in Africa and 28 experts in ML, Health, and/or policy focussed on Africa to obtain feedback on the proposed attributes. We delve specifically into understanding the interplay between AI, health and colonialism. Our findings demonstrate that among experts there is a general mistrust that technologies that are solely developed by former colonizers can benefit Africans, and that associated resource constraints due to pre-existing economic and infrastructure inequities can be linked to colonialism. General population survey responses found about an average of 40% of people associate an undercurrent of colonialism to AI and this was most dominant amongst participants from South Africa. However the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism.Colonial history, country of origin, National income level were specific axes of disparities that participants felt would cause an AI tool to be biased This work serves as a basis for policy development around Artificial Intelligence for health in Africa and can be expanded to other regions. View details

1
2

of 3

of 3 pages

Search on Google Scholar

Join us

We're always looking for more talented, passionate people.

See opportunities

Follow us

Explore our other initiatives

Google AI

Discover how Google AI is committed to enriching knowledge and solving complex challenges

Products
Build
Research
Responsibility
Societal Impact
About

Google Cloud

High-performance infrastructure for cloud computing, data analytics & machine learning

Overview
Solutions
Products
Pricing
Resources

Google DeepMind

Our mission is to build AI responsibly to benefit humanity

Models
Research
Science
About

Google Labs

Explore the future of AI responsibly with Google Labs

About
Experiments
Stay connected

Google Products

×