Jump to Content
Dr Christopher Kelly

Dr Christopher Kelly

Research Scientist in the Google Health Research team in London, working on medical imaging research.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large Language Models Encode Clinical Knowledge
    Karan Singhal
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Nature (2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details
    Stakeholder Perspectives of Clinical Artificial Intelligence Implementation: Systematic Review of Qualitative Evidence
    Jeffry Hogg
    Mohaimen Al-Zubaidy
    S. James Talks
    Alastair Denniston
    Johann Malawana
    Chrysanthi Papoutsi
    Dawn Teare
    Pearse Keane
    Fiona R. Beyer
    Gregory Maniatopoulos
    JMIR, vol. 23 (2023)
    Preview abstract BACKGROUND: The rhetoric surrounding clinical artificial intelligence (AI) often exaggerates its effect on real-world care. Limited understanding of the factors that influence its implementation can perpetuate this. OBJECTIVE: In this qualitative systematic review, we aimed to identify key stakeholders, consolidate their perspectives on clinical AI implementation, and characterize the evidence gaps that future qualitative research should target. METHODS: Ovid-MEDLINE, EBSCO-CINAHL, ACM Digital Library, Science Citation Index-Web of Science, and Scopus were searched for primary qualitative studies on individuals’ perspectives on any application of clinical AI worldwide (January 2014-April 2021). The definition of clinical AI includes both rule-based and machine learning–enabled or non–rule-based decision support tools. The language of the reports was not an exclusion criterion. Two independent reviewers performed title, abstract, and full-text screening with a third arbiter of disagreement. Two reviewers assigned the Joanna Briggs Institute 10-point checklist for qualitative research scores for each study. A single reviewer extracted free-text data relevant to clinical AI implementation, noting the stakeholders contributing to each excerpt. The best-fit framework synthesis used the Nonadoption, Abandonment, Scale-up, Spread, and Sustainability (NASSS) framework. To validate the data and improve accessibility, coauthors representing each emergent stakeholder group codeveloped summaries of the factors most relevant to their respective groups. RESULTS: The initial search yielded 4437 deduplicated articles, with 111 (2.5%) eligible for inclusion (median Joanna Briggs Institute 10-point checklist for qualitative research score, 8/10). Five distinct stakeholder groups emerged from the data: health care professionals (HCPs), patients, carers and other members of the public, developers, health care managers and leaders, and regulators or policy makers, contributing 1204 (70%), 196 (11.4%), 133 (7.7%), 129 (7.5%), and 59 (3.4%) of 1721 eligible excerpts, respectively. All stakeholder groups independently identified a breadth of implementation factors, with each producing data that were mapped between 17 and 24 of the 27 adapted Nonadoption, Abandonment, Scale-up, Spread, and Sustainability subdomains. Most of the factors that stakeholders found influential in the implementation of rule-based clinical AI also applied to non–rule-based clinical AI, with the exception of intellectual property, regulation, and sociocultural attitudes. CONCLUSIONS: Clinical AI implementation is influenced by many interdependent factors, which are in turn influenced by at least 5 distinct stakeholder groups. This implies that effective research and practice of clinical AI implementation should consider multiple stakeholder perspectives. The current underrepresentation of perspectives from stakeholders other than HCPs in the literature may limit the anticipation and management of the factors that influence successful clinical AI implementation. Future research should not only widen the representation of tools and contexts in qualitative research but also specifically investigate the perspectives of all stakeholder HCPs and emerging aspects of non–rule-based clinical AI implementation. View details
    Enhancing diagnostic accuracy of medical AI systems via selective deferral to clinicians
    Dj Dvijotham
    Melih Barsbey
    Sumedh Ghaisas
    Robert Stanforth
    Nick Pawlowski
    Patricia Strachan
    Zahra Ahmed
    Yoram Bachrach
    Laura Culp
    Mayank Daswani
    Jan Freyberg
    Atilla Kiraly
    Timo Kohlberger
    Scott Mayer McKinney
    Basil Mustafa
    Krzysztof Geras
    Jan Witowski
    Zhi Zhen Qin
    Jacob Creswell
    Shravya Shetty
    Terry Spitz
    Taylan Cemgil
    Nature Medicine (2023)
    Preview abstract AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. While these results are impressive, they don’t accurately reflect the impact of deployment of such systems in a clinical context. Due to the safety-critical nature of this domain and the fact that AI systems are not perfect and can make inaccurate assessments, they are predominantly deployed as assistive tools for clinical experts3. Although clinicians routinely discuss the diagnostic nuances of medical images with each other, weighing human diagnostic confidence against that of an AI system remains a major unsolved barrier to collaborative decision-making4. Furthermore, it has been observed that diagnostic AI models have complementary strengths and weaknesses compared to clinical experts. Yet, complementarity and the assessment of relative confidence between the members of a diagnostic team has remained largely unexploited in how AI systems are currently used in medical settings5. In this paper, we study the behavior of a team composed of diagnostic AI model(s) and clinician(s) in diagnosing disease. To go beyond the performance level of a standalone AI system, we develop a novel selective deferral algorithm that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinical expert. Using this algorithm, we demonstrate that the composite AI+human system has enhanced accuracy (both sensitivity and specificity) relative to a human-only or an AI-only baseline. We decouple the development of the deferral AI model from training of the underlying diagnostic AI model(s). Development of the deferral AI model only requires i) the predictions of a model(s) on a tuning set of medical images (separate from the diagnostic AI models’ training data), ii) the diagnoses made by clinicians on these images and iii) the ground truth disease labels corresponding to those images. Our extensive analysis shows that the selective deferral (SD) system exceeds the performance of either clinicians or AI alone in multiple clinical settings: breast and lung cancer screening. For breast cancer screening, double-reading with arbitration (two readers interpreting each mammogram invoking an arbitrator if needed) is a “gold standard” for performance, never previously exceeded using AI6. The SD system exceeds the accuracy of double-reading with arbitration in a large representative UK screening program (25% reduction in false positives despite equivalent true-positive detection and 66% reduction in the requirement for clinicians to read an image), as well as exceeding the performance of a standalone state-of-art AI system (40% reduction in false positives with equivalent detection of true positives). In a large US dataset the SD system exceeds the accuracy of single-reading by board-certified radiologists and a standalone state-of-art AI system (32% reduction in false positives despite equivalent detection of true positives and 55% reduction in the clinician workload required). The SD system further outperforms both clinical experts alone, and AI alone for the detection of lung cancer in low-dose Computed Tomography images from a large national screening study, with 11% reduction in false positives while maintaining sensitivity given 93% reduction in clinician workload required. Furthermore, the SD system allows controllable trade-offs between sensitivity and specificity and can be tuned to target either specificity or sensitivity as desired for a particular clinical application, or a combination of both. The system generalizes to multiple distribution shifts, retaining superiority to both the AI system alone and human experts alone. We demonstrate that the SD system retains performance gains even on clinicians not present in the training data for the deferral AI. Furthermore, we test the SD system on a new population where the standalone AI system’s performance significantly degrades. We showcase the few-shot adaptation capability of the SD system by demonstrating that the SD system can obtain superiority to both the standalone AI system and the clinician on the new population after being trained on only 40 cases from the new population. Our comprehensive assessment demonstrates that a selective deferral system could significantly improve clinical outcomes in multiple medical imaging applications, paving the way for higher performance clinical AI systems that can leverage the complementarity between clinical experts and medical AI tools. View details
    ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
    Shawn Xu
    Lin Yang
    Timo Kohlberger
    Martin Ma
    Atilla Kiraly
    Sahar Kazemzadeh
    Zakkai Melamed
    Jungyeon Park
    Patricia MacWilliams
    Chuck Lau
    Christina Chen
    Mozziyar Etemadi
    Sreenivasa Raju Kalidindi
    Kat Chou
    Shravya Shetty
    Daniel Golden
    Rory Pilgrim
    arxiv (2023)
    Preview abstract Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. View details
    Validation and Clinical Applicability of Whole-Volume Automated Segmentation of Optical Coherence Tomography in Retinal Disease Using Deep Learning
    Marc Wilson
    Reena Chopra
    Megan Zoë Wilson
    Charlotte Cooper
    Patricia MacWilliams
    Daniela Florea
    Cían Hughes
    Hagar Khalid
    Sandra Vermeirsch
    Luke Nicholson
    Pearse Keane
    Konstantinos Balaskas
    JAMA Ophthalmology (2021)
    Preview abstract Importance Quantitative volumetric measures of retinal disease in optical coherence tomography (OCT) scans are infeasible to perform owing to the time required for manual grading. Expert-level deep learning systems for automatic OCT segmentation have recently been developed. However, the potential clinical applicability of these systems is largely unknown. Objective To evaluate a deep learning model for whole-volume segmentation of 4 clinically important pathological features and assess clinical applicability. Design, Setting, Participants This diagnostic study used OCT data from 173 patients with a total of 15 558 B-scans, treated at Moorfields Eye Hospital. The data set included 2 common OCT devices and 2 macular conditions: wet age-related macular degeneration (107 scans) and diabetic macular edema (66 scans), covering the full range of severity, and from 3 points during treatment. Two expert graders performed pixel-level segmentations of intraretinal fluid, subretinal fluid, subretinal hyperreflective material, and pigment epithelial detachment, including all B-scans in each OCT volume, taking as long as 50 hours per scan. Quantitative evaluation of whole-volume model segmentations was performed. Qualitative evaluation of clinical applicability by 3 retinal experts was also conducted. Data were collected from June 1, 2012, to January 31, 2017, for set 1 and from January 1 to December 31, 2017, for set 2; graded between November 2018 and January 2020; and analyzed from February 2020 to November 2020. Main Outcomes and Measures Rating and stack ranking for clinical applicability by retinal specialists, model-grader agreement for voxelwise segmentations, and total volume evaluated using Dice similarity coefficients, Bland-Altman plots, and intraclass correlation coefficients. Results Among the 173 patients included in the analysis (92 [53%] women), qualitative assessment found that automated whole-volume segmentation ranked better than or comparable to at least 1 expert grader in 127 scans (73%; 95% CI, 66%-79%). A neutral or positive rating was given to 135 model segmentations (78%; 95% CI, 71%-84%) and 309 expert gradings (2 per scan) (89%; 95% CI, 86%-92%). The model was rated neutrally or positively in 86% to 92% of diabetic macular edema scans and 53% to 87% of age-related macular degeneration scans. Intraclass correlations ranged from 0.33 (95% CI, 0.08-0.96) to 0.96 (95% CI, 0.90-0.99). Dice similarity coefficients ranged from 0.43 (95% CI, 0.29-0.66) to 0.78 (95% CI, 0.57-0.85). Conclusions and Relevance This deep learning–based segmentation tool provided clinically useful measures of retinal disease that would otherwise be infeasible to obtain. Qualitative evaluation was additionally important to reveal clinical applicability for both care management and research. View details
    Quantitative Analysis of OCT for Neovascular Age-Related Macular Degeneration Using Deep Learning
    Gabriella Moraes
    Dun Jack Fu
    Marc Wilson
    Hagar Khalid
    Siegfried Wagner
    Edward Korot
    Robbert Struyven
    Daniel Ferraz
    Livia Faes
    Terry Spitz
    Praveen Patel
    Konstantinos Balaskas
    Tiarnan Keenan
    Pearse Keane
    Reena Chopra
    Ophthalmology (2021)
    Preview abstract Purpose: To apply a deep learning algorithm for automated, objective, and comprehensive quantification of OCT scans to a large real-world dataset of eyes with neovascular age-related macular degeneration (AMD) and make the raw segmentation output data openly available for further research. Design: Retrospective analysis of OCT images from the Moorfields Eye Hospital AMD Database. Participants A total of 2473 first-treated eyes and 493 second-treated eyes that commenced therapy for neovascular AMD between June 2012 and June 2017. Methods: A deep learning algorithm was used to segment all baseline OCT scans. Volumes were calculated for segmented features such as neurosensory retina (NSR), drusen, intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), retinal pigment epithelium (RPE), hyperreflective foci (HRF), fibrovascular pigment epithelium detachment (fvPED), and serous PED (sPED). Analyses included comparisons between first- and second-treated eyes by visual acuity (VA) and race/ethnicity and correlations between volumes. Main Outcome Measures: Volumes of segmented features (mm3) and central subfield thickness (CST) (μm). Results: In first-treated eyes, the majority had both IRF and SRF (54.7%). First-treated eyes had greater volumes for all segmented tissues, with the exception of drusen, which was greater in second-treated eyes. In first-treated eyes, older age was associated with lower volumes for RPE, SRF, NSR, and sPED; in second-treated eyes, older age was associated with lower volumes of NSR, RPE, sPED, fvPED, and SRF. Eyes from Black individuals had higher SRF, RPE, and serous PED volumes compared with other ethnic groups. Greater volumes of the majority of features were associated with worse VA. Conclusions: We report the results of large-scale automated quantification of a novel range of baseline features in neovascular AMD. Major differences between first- and second-treated eyes, with increasing age, and between ethnicities are highlighted. In the coming years, enhanced, automated OCT segmentation may assist personalization of real-world care and the detection of novel structure–function correlations. These data will be made publicly available for replication and future investigation by the AMD research community. View details
    Artificial Intelligence in Pediatrics
    Alex Brown
    Jim Taylor
    Artificial Intelligence in Medicine, Springer (2021)
    Preview abstract Pediatrics is a specialty with significant promise for the application of artificial intelligence (AI) technologies, in part due to the richness of its datasets, with relatively more complete longitudinal records and often less heterogeneous patterns of disease compared to adult medicine. Despite considerable overlap with adult medicine, pediatrics presents a distinct set of clinical problems to solve. It is tempting to assume that AI tools developed for adults will easily translate to the pediatric population, where in reality this is unlikely to be the case. The challenges involved in the development of AI tools for healthcare are unfortunately exacerbated in pediatrics, and the implementation gap between how these systems are developed and the setting in which they will be deployed is a real challenge for the next decade. Robust evaluation through high quality clinical study design and clear reporting standards will be essential. This chapter reviews recent work to develop artificial intelligence solutions in pediatrics, including developments across cardiology, respiratory, gastroenterology, neonatology, genetics, endocrinology, ophthalmology, radiology, pediatric intensive care, and radiology specialties. We conclude that AI presents an exciting opportunity to transform aspects of pediatrics at a global scale, democratizing access to subspecialist diagnostic skills, improving quality and efficiency of care, enabling global access to healthcare through sensor-rich Internet-connected mobile devices, and enhancing imaging acquisition to reduce radiation while improving speed and quality. The ultimate challenge will be for pediatricians to find ways to deploy these novel technologies into clinical practice in a way that is safe, effective, and equitable and that ultimately improves outcomes for children. View details
    Clinically Applicable Segmentation of Head and Neck Anatomy for Radiotherapy: Deep Learning Algorithm Development and Validation Study
    Stanislav Nikolov
    Sam Blackwell
    Alexei Zverovitch
    Ruheena Mendes
    Michelle Livne
    Jeffrey De Fauw
    Yojan Patel
    Clemens Meyer
    Harry Askham
    Bernardino Romera Paredes
    Carlton Chu
    Dawn Carnell
    Cheng Boon
    Derek D'Souza
    Syed Moinuddin
    Yasmin Mcquinlan
    Sarah Ireland
    Kiarna Hampton
    Krystle Fuller
    Hugh Montgomery
    Geraint Rees
    Mustafa Suleyman
    Trevor John Back
    Cían Hughes
    Olaf Ronneberger
    JMIR (2021)
    Preview abstract Background: Over half a million individuals are diagnosed with head and neck cancer each year globally. Radiotherapy is an important curative treatment for this disease, but it requires manual time to delineate radiosensitive organs at risk. This planning process can delay treatment while also introducing interoperator variability, resulting in downstream radiation dose differences. Although auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying, and achieving expert performance remain. Objective: Adopting a deep learning approach, we aim to demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck organs at risk commonly segmented in clinical practice. Methods: The model was trained on a data set of 663 deidentified computed tomography scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus organ at risk definitions. Results: We demonstrated the model’s clinical applicability by assessing its performance on a test set of 21 computed tomography scans from clinical practice, each with 21 organs at risk segmented by 2 independent experts. We also introduced surface Dice similarity coefficient, a new metric for the comparison of organ delineation, to quantify the deviation between organ at risk surface contours rather than volumes, better reflecting the clinical task of correcting errors in automated organ segmentations. The model’s generalizability was then demonstrated on 2 distinct open-source data sets, reflecting different centers and countries to model training. Conclusions: Deep learning is an effective and clinically applicable technique for the segmentation of the head and neck anatomy for radiotherapy. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways. View details
    Quantitative analysis of optical coherence tomography for neovascular age-related macular degeneration using deep learning
    Gabriella Moraes
    Dun Jack Fu
    Marc Wilson
    Hagar Khalid
    Siegfried Wagner
    Edward Korot
    Daniel Ferraz
    Livia Faes
    Terry Spitz
    Praveen Patel
    Konstantinos Balaskas
    Tiarnan Keenan
    Pearse Keane
    Reena Chopra
    Ophthalmology (2020)
    Preview abstract Purpose: To apply a deep learning algorithm for automated, objective, and comprehensive quantification of optical coherence tomography (OCT) scans to a large real-world dataset of eyes with neovascular age-related macular degeneration (AMD), and make the raw data openly available for further research. Design: Retrospective analysis of OCT images from the Moorfields Eye Hospital AMD Database. Participants: 2473 first-treated eyes and 493 second-treated eyes that commenced therapy for neovascular AMD between June 2012 and June 2017. Methods: A deep learning algorithm was used to segment all baseline OCT scans. Volumes were calculated for segmented features such as neurosensory retina (NSR), drusen, intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), retinal pigment epithelium (RPE), hyperreflective foci (HRF), fibrovascular pigment epithelium detachment (fvPED), and serous PED (sPED). Analyses included comparisons between first and second eyes, by visual acuity (VA) and by ethnicity, and correlations between volumes. Main outcome measures: Volumes of segmented features (mm3), central subfield thickness (CST) (μm). Results: In first-treated eyes, the majority had both IRF and SRF (54.7%). First-treated eyes had greater volumes for all segmented tissues, with the exception of drusen, which was greater in second-treated eyes. In first-treated eyes, older age was associated with lower volumes for RPE, SRF, NSR and sPED; in second-treated eyes, older age was associated with lower volumes of NSR, RPE, sPED, fvPED and SRF. Eyes from black individuals had higher SRF, RPE and serous PED volumes, compared with other ethnic groups. For almost all features, greater volumes of each feature were associated with worse VA. Conclusion: We report the results of large scale, novel, automated quantification of baseline features in neovascular AMD. Major differences between first and second-treated eyes, with increasing age, and between ethnicities are highlighted. In the coming years, enhanced, automated OCT segmentation may be of benefit for personalization of real-world care, as well as for the detection of novel structure-function correlations. This data will be made publicly available for replication and future investigation by the AMD research community. View details
    Predicting conversion to wet age-related macular degeneration using deep learning
    Jason Yim
    Reena Chopra
    Terry Spitz
    Annette Obika
    Harry Askham
    Marko Lukic
    Josef Huemer
    Katrin Fasler
    Gabriella Moraes
    Clemens Meyer
    Marc Wilson
    Jonathan Mark Dixon
    Cían Hughes
    Geraint Rees
    Peng Khaw
    Dominic King
    Demis Hassabis
    Mustafa Suleyman
    Trevor John Back
    Pearse Keane
    Jeffrey De Fauw
    Nature Medicine (2020)
    Preview abstract Progression to exudative ‘wet’ age-related macular degeneration (exAMD) is a major cause of visual deterioration. In patients diagnosed with exAMD in one eye, we introduce an artificial intelligence (AI) system to predict progression to exAMD in the second eye. By combining models based on three-dimensional (3D) optical coherence tomography images and corresponding automatic tissue maps, our system predicts conversion to exAMD within a clinically actionable 6-month time window, achieving a per-volumetric-scan sensitivity of 80% at 55% specificity, and 34% sensitivity at 90% specificity. This level of performance corresponds to true positives in 78% and 41% of individual eyes, and false positives in 56% and 17% of individual eyes at the high sensitivity and high specificity points, respectively. Moreover, we show that automatic tissue segmentation can identify anatomical changes before conversion and high-risk subgroups. This AI system overcomes substantial interobserver variability in expert predictions, performing better than five out of six experts, and demonstrates the potential of using AI to predict disease progression. View details
    International evaluation of an AI system for breast cancer screening
    Scott Mayer McKinney
    Varun Yatindra Godbole
    Jonathan Godwin
    Natasha Antropova
    Hutan Ashrafian
    Trevor John Back
    Mary Chesus
    Ara Darzi
    Mozziyar Etemadi
    Florencia Garcia-Vicente
    Fiona J Gilbert
    Mark D Halling-Brown
    Demis Hassabis
    Sunny Jansen
    Dominic King
    David Melnick
    Hormuz Mostofi
    Lily Hao Yi Peng
    Joshua Reicher
    Bernardino Romera Paredes
    Richard Sidebottom
    Mustafa Suleyman
    Kenneth C. Young
    Jeffrey De Fauw
    Shravya Ramesh Shetty
    Nature (2020)
    Preview abstract Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening. View details
    Preview abstract Background Artificial intelligence (AI) research in healthcare is accelerating rapidly with potential applications being demonstrated across many different domains of medicine. However, there are currently limited examples of such techniques being successfully deployed into clinical practice. This article explores the main challenges and limitations of AI in healthcare, and considers steps required to translate these potentially transformative technologies from research to clinical practice. Main body Key challenges for the translation of AI systems in healthcare include those intrinsic to the science of machine learning, logistical difficulties in implementation, and consideration of barriers to adoption or necessary sociocultural or pathway change. Robust peer-reviewed clinical evaluation as part of randomised controlled trials should be viewed as the gold standard for evidence generation, but conducting these in practice may not always be appropriate or feasible. Performance metrics should aim to capture real clinical applicability, and be understandable to intended users. Regulation that balances pace of innovation with potential for harm, alongside thoughtful postmarket surveillance, is required to ensure that patients are not exposed to dangerous interventions, nor deprived of access to beneficial innovations. Mechanisms to enable direct comparisons of AI systems must be developed, including the use of independent, local and representative test sets. Developers of AI algorithms must be vigilant to potential dangers including dataset shift, accidental fitting of confounders, unintended discriminatory bias, the challenges of generalisation to new populations, and unintended negative consequences of new algorithms on health outcomes. Conclusions The safe and timely translation of AI research into clinically validated tools that can benefit everyone is challenging. Further work is required to continue developing robust clinical evaluation and regulatory frameworks using metrics that are intuitive to clinicians, identifying themes of algorithmic bias and unfairness while developing mitigations to address this, reducing brittleness and improving generalisability, and developing methods for improved interpretability of machine learning models. If these goals can be achieved, the benefits for patients are likely to be transformational. View details
    No Results Found