Jump to Content

Measuring clinician-machine agreement in differential diagnoses for dermatology

Clara Eng
Rajiv Bhatnagar
British Journal of Dermatology (2019)

Abstract

Artificial intelligence (AI) algorithms have generated significant interest as a tool to assist in clinical workflows, particularly in image-based diagnostics such as melanoma detection. These algorithms typically answer narrowly scoped questions, such as ‘Is this lesion malignant?’ By contrast, dermatologists frequently tackle less structured diagnostic questions, such as ‘What is this rash?’ In practice, evaluating clinical cases often involves integrating insights from morphology, context and history to determine a ranked-ordered list of possible diagnoses, i.e. a differential diagnosis rather than a binary ‘yes’ or ‘no’ answer. An AI algorithm could aid a less experienced clinician by providing its own differential diagnosis, which may highlight potential diagnoses that have not been considered, and thereby help the clinician decide between additional evaluation and empiric treatment. AI-generated differential diagnoses could also be used to help rapidly triage cases, allowing cases with higher suspicion for dangerous entities such as melanoma to be seen first. However, in addition to the inherent laboriousness of ‘labelling’ cases with a differential instead of a single diagnosis, developing an AI algorithm to generate a differential raises a more fundamental conundrum: given a reference standard differential diagnosis from an experienced dermatologist, how do we evaluate the ‘correctness’ of the AI’s differential diagnosis?