Evaluation of artificial intelligence on a reference standard based on subjective interpretation

The Lancet Digital Health (2021)


Rapid progress has been made in artificial intelligence (AI) models for medical applications, especially over the past 5 years, with substantial efforts focusing on diagnosis from medical images. An essential aspect of evaluating the performance of AI models and their potential clinical utility is the rigor of the reference standard. A reference standard is “the best available method for establishing the presence or absence of the target condition”, and is thus equivalent to what is commonly referred to as the ground truth in AI literature. Determination of what constitutes a reference standard is established by “opinion and practice within the medical, laboratory, and regulatory community”. The reference standard can either be a widely agreed-upon gold standard2 or, in its absence, a proxy that is highly correlated with the clinical outcome. Although a non-reference standard can also be used, correctness claims such as accuracy, sensitivity, and specificity should be dropped in favour of agreement with a comparative method.

