Google Research

Evaluation of artificial intelligence on a reference standard based on subjective interpretation

  • Cameron Chen
  • Craig Mermel
  • Yun Liu
The Lancet Digital Health (2021)


Rapid progress has been made in artificial intelligence (AI) models for medical applications, especially over the past 5 years, with substantial efforts focusing on diagnosis from medical images. An essential aspect of evaluating the performance of AI models and their potential clinical utility is the rigor of the reference standard. A reference standard is “the best available method for establishing the presence or absence of the target condition”, and is thus equivalent to what is commonly referred to as the ground truth in AI literature. Determination of what constitutes a reference standard is established by “opinion and practice within the medical, laboratory, and regulatory community”. The reference standard can either be a widely agreed-upon gold standard2 or, in its absence, a proxy that is highly correlated with the clinical outcome. Although a non-reference standard can also be used, correctness claims such as accuracy, sensitivity, and specificity should be dropped in favour of agreement with a comparative method.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work