Large Language Models Encode Clinical Knowledge

Karan Singhal; Shekoofeh Azizi; Tao Tu; Sara Mahdavi; Jason Wei; Hyung Won Chung; Nathan Scales; Ajay Tanwani; Heather Cole-Lewis; Perry Payne; Stephen Pfohl; Martin Seneviratne; Paul Gamble; Christopher Kelly; Abubakr Abdelrazig Hassan Babiker; Nathanael Schaerli; Aakanksha Chowdhery; Philip Mansfield; Dina Demner-Fushman; Blaise Aguera-Arcas; Dale Webster; Greg Corrado; Yossi Matias; Katherine Chou; Juraj Gottweis; Nenad Tomašev; Yun Liu; Alvin Rajkomar; Joelle Barral; Christopher Semturs; Alan Karthikesalingam; Vivek Natarajan

Large Language Models Encode Clinical Knowledge

Karan Singhal

Shekoofeh Azizi

Tao Tu

Sara Mahdavi

Jason Wei

Hyung Won Chung

Nathan Scales

Ajay Tanwani

Heather Cole-Lewis

Perry Payne

Stephen Pfohl

Martin Seneviratne

Paul Gamble

Christopher Kelly

Abubakr Abdelrazig Hassan Babiker

Nathanael Schaerli

Aakanksha Chowdhery

Philip Mansfield

Dina Demner-Fushman

Blaise Aguera-Arcas

Dale Webster

Greg Corrado

Yossi Matias

Katherine Chou

Juraj Gottweis

Nenad Tomašev

Yun Liu

Alvin Rajkomar

Joelle Barral

Christopher Semturs

Alan Karthikesalingam

Vivek Natarajan

Nature (2023)

Download Google Scholar

Listen with Illuminate

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Large Language Models Encode Clinical Knowledge

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs