AfriMed-QA: A Pan-African Multi-Specialty Medical Question-Answering Benchmark Dataset

Tobi Olatunji
Abraham Toluwase Owodunni
Charles Nimo
Jennifer Orisakwe
Henok Biadglign Ademtew
Chris Fourie
Foutse Yuehgoh
Stephen Moore
Mardhiyah Sanni
Emmanuel Ayodele
Timothy Faniran
Bonaventure F. P. Dossou
Fola Omofoye
Wendy Kinara
Tassallah Abdullahi
Michael Best
2025

Abstract

Recent advancements in large language model (LLM) performance on medical multiple-choice question (MCQ) benchmarks have stimulated significant interest from patients and healthcare providers globally. Particularly in low- and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, LLM training data is sourced from predominantly Western text, existing benchmarks are predominantly Western-centric, limited to MCQs, and focused on a narrow range of clinical specialties, raising concerns about their applicability in the Global South, particularly across Africa where localized medical knowledge and linguistic diversity are often underrepresented. In this work, we introduce AfriMed-QA, the first large-scale multi-specialty Pan-African medical Question-Answer (QA) dataset designed to evaluate and develop equitable and effective LLMs for African healthcare. It contains 3,000 multiple-choice professional medical exam questions with answers and rationale, 1,500 short answer questions (SAQ) with long-from answers, and 5,500 consumer queries, sourced from over 60 medical schools across 15 countries, covering 32 medical specialties. We further rigorously evaluate multiple open, closed, general, and biomedical LLMs across multiple axes including accuracy, consistency, factuality, bias, potential for harm, local geographic relevance, medical reasoning, and recall. We believe this dataset provides a valuable resource for practical application of large language models in African healthcare and enhances the geographical diversity of health-LLM benchmark datasets.