Dr Christopher Kelly

Dr Christopher Kelly

Research Scientist in the Google Health Research team in London, working on medical imaging research.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Concordance of randomised controlled trials for artificial intelligence interventions with the CONSORT-AI reporting guidelines
    Aditya U Kale
    Alastair Dennison
    Alexander Martindale
    An Wen Chan
    Andrew Beam
    Benjamin Ng
    Cecilia S. Lee
    Christopher Yau
    David Moher
    Gary Collins
    Lauren Oakden-Rayner
    Lavinia Ferrante di Ruffano
    Melanie Calvert
    Melissa D McCradden
    Pearse Keane
    Robert Golub
    Samantha Cruz Rivera
    Victoria Ngai
    Xiaoxuan Liu
    Nature Communications(2024)
    Preview abstract The Consolidated Standards of Reporting Trials extension for Artificial Intelligence interventions (CONSORT-AI) was published in September 2020. Since its publication, several randomised controlled trials (RCTs) of AI interventions have been published but their completeness and transparency of reporting is unknown. This systematic review assesses the completeness of reporting of AI RCTs following publication of CONSORT-AI and provides a comprehensive summary of RCTs published in recent years. 65 RCTs were identified, mostly conducted in China (37%) and USA (18%). Median concordance with CONSORT-AI reporting was 90% (IQR 77–94%), although only 10 RCTs explicitly reported its use. Several items were consistently under-reported, including algorithm version, accessibility of the AI intervention or code, and references to a study protocol. Only 3 of 52 included journals explicitly endorsed or mandated CONSORT-AI. Despite a generally high concordance amongst recent AI RCTs, some AI-specific considerations remain systematically poorly reported. Further encouragement of CONSORT-AI adoption by journals and funders may enable more complete adoption of the full CONSORT-AI guidelines. View details
    Artificial intelligence as a second reader for screening mammography
    Etsuji Nakai
    Alessandro Scoccia Pappagallo
    Hiroki Kayama
    Lin Yang
    Shawn Xu
    Timo Kohlberger
    Daniel Golden
    Akib Uddin
    Radiology Advances, 1(2)(2024)
    Preview abstract Background Artificial intelligence (AI) has shown promise in mammography interpretation, and its use as a second reader in breast cancer screening may reduce the burden on health care systems. Purpose To evaluate the performance differences between routine double read and an AI as a second reader workflow (AISR), where the second reader is replaced with AI. Materials and Methods A cohort of patients undergoing routine breast cancer screening at a single center with mammography was retrospectively collected between 2005 and 2021. A model developed on US and UK data was fine-tuned on Japanese data. We subsequently performed a reader study with 10 qualified readers with varied experience (5 reader pairs), comparing routine double read to an AISR workflow. Results A “test set” of 4,059 women (mean age, 56 ± 14 years; 157 positive, 3,902 negative) was collected, with 278 (mean age 55 ± 13 years; 90 positive, 188 negative) evaluated for the reader study. We demonstrate an area under the curve =.84 (95% confidence interval [CI], 0.805-0.881) on the test set, with no significant difference to decisions made in clinical practice (P = .32). Compared with routine double reading, in the AISR arm, sensitivity improved by 7.6% (95% CI, 3.80-11.4; P = .00004) and specificity decreased 3.4% (1.42-5.43; P = .0016), with 71% (212/298) of scans no longer requiring input from a second reader. Variation in recall decision between reader pairs improved from a Cohen kappa of κ = .65 (96% CI, 0.61-0.68) to κ = .74 (96% CI, 0.71-0.77) in the AISR arm. View details
    Enhancing diagnostic accuracy of medical AI systems via selective deferral to clinicians
    Dj Dvijotham
    Melih Barsbey
    Sumedh Ghaisas
    Robert Stanforth
    Nick Pawlowski
    Patricia Strachan
    Zahra Ahmed
    Yoram Bachrach
    Laura Culp
    Jan Freyberg
    Atilla Kiraly
    Timo Kohlberger
    Scott Mayer McKinney
    Basil Mustafa
    Krzysztof Geras
    Jan Witowski
    Zhi Zhen Qin
    Jacob Creswell
    Shravya Shetty
    Terry Spitz
    Taylan Cemgil
    Nature Medicine(2023)
    Preview abstract AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. While these results are impressive, they don’t accurately reflect the impact of deployment of such systems in a clinical context. Due to the safety-critical nature of this domain and the fact that AI systems are not perfect and can make inaccurate assessments, they are predominantly deployed as assistive tools for clinical experts3. Although clinicians routinely discuss the diagnostic nuances of medical images with each other, weighing human diagnostic confidence against that of an AI system remains a major unsolved barrier to collaborative decision-making4. Furthermore, it has been observed that diagnostic AI models have complementary strengths and weaknesses compared to clinical experts. Yet, complementarity and the assessment of relative confidence between the members of a diagnostic team has remained largely unexploited in how AI systems are currently used in medical settings5. In this paper, we study the behavior of a team composed of diagnostic AI model(s) and clinician(s) in diagnosing disease. To go beyond the performance level of a standalone AI system, we develop a novel selective deferral algorithm that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinical expert. Using this algorithm, we demonstrate that the composite AI+human system has enhanced accuracy (both sensitivity and specificity) relative to a human-only or an AI-only baseline. We decouple the development of the deferral AI model from training of the underlying diagnostic AI model(s). Development of the deferral AI model only requires i) the predictions of a model(s) on a tuning set of medical images (separate from the diagnostic AI models’ training data), ii) the diagnoses made by clinicians on these images and iii) the ground truth disease labels corresponding to those images. Our extensive analysis shows that the selective deferral (SD) system exceeds the performance of either clinicians or AI alone in multiple clinical settings: breast and lung cancer screening. For breast cancer screening, double-reading with arbitration (two readers interpreting each mammogram invoking an arbitrator if needed) is a “gold standard” for performance, never previously exceeded using AI6. The SD system exceeds the accuracy of double-reading with arbitration in a large representative UK screening program (25% reduction in false positives despite equivalent true-positive detection and 66% reduction in the requirement for clinicians to read an image), as well as exceeding the performance of a standalone state-of-art AI system (40% reduction in false positives with equivalent detection of true positives). In a large US dataset the SD system exceeds the accuracy of single-reading by board-certified radiologists and a standalone state-of-art AI system (32% reduction in false positives despite equivalent detection of true positives and 55% reduction in the clinician workload required). The SD system further outperforms both clinical experts alone, and AI alone for the detection of lung cancer in low-dose Computed Tomography images from a large national screening study, with 11% reduction in false positives while maintaining sensitivity given 93% reduction in clinician workload required. Furthermore, the SD system allows controllable trade-offs between sensitivity and specificity and can be tuned to target either specificity or sensitivity as desired for a particular clinical application, or a combination of both. The system generalizes to multiple distribution shifts, retaining superiority to both the AI system alone and human experts alone. We demonstrate that the SD system retains performance gains even on clinicians not present in the training data for the deferral AI. Furthermore, we test the SD system on a new population where the standalone AI system’s performance significantly degrades. We showcase the few-shot adaptation capability of the SD system by demonstrating that the SD system can obtain superiority to both the standalone AI system and the clinician on the new population after being trained on only 40 cases from the new population. Our comprehensive assessment demonstrates that a selective deferral system could significantly improve clinical outcomes in multiple medical imaging applications, paving the way for higher performance clinical AI systems that can leverage the complementarity between clinical experts and medical AI tools. View details
    Large Language Models Encode Clinical Knowledge
    Karan Singhal
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Nature(2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details
    ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
    Shawn Xu
    Lin Yang
    Timo Kohlberger
    Martin Ma
    Atilla Kiraly
    Sahar Kazemzadeh
    Zakkai Melamed
    Jungyeon Park
    Patricia MacWilliams
    Chuck Lau
    Christina Chen
    Mozziyar Etemadi
    Sreenivasa Raju Kalidindi
    Kat Chou
    Shravya Shetty
    Daniel Golden
    Rory Pilgrim
    Krish Eswaran
    arxiv(2023)
    Preview abstract Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. View details
    Stakeholder Perspectives of Clinical Artificial Intelligence Implementation: Systematic Review of Qualitative Evidence
    Jeffry Hogg
    Mohaimen Al-Zubaidy
    S. James Talks
    Alastair Denniston
    Johann Malawana
    Chrysanthi Papoutsi
    Dawn Teare
    Pearse Keane
    Fiona R. Beyer
    Gregory Maniatopoulos
    JMIR, 23(2023)
    Preview abstract BACKGROUND: The rhetoric surrounding clinical artificial intelligence (AI) often exaggerates its effect on real-world care. Limited understanding of the factors that influence its implementation can perpetuate this. OBJECTIVE: In this qualitative systematic review, we aimed to identify key stakeholders, consolidate their perspectives on clinical AI implementation, and characterize the evidence gaps that future qualitative research should target. METHODS: Ovid-MEDLINE, EBSCO-CINAHL, ACM Digital Library, Science Citation Index-Web of Science, and Scopus were searched for primary qualitative studies on individuals’ perspectives on any application of clinical AI worldwide (January 2014-April 2021). The definition of clinical AI includes both rule-based and machine learning–enabled or non–rule-based decision support tools. The language of the reports was not an exclusion criterion. Two independent reviewers performed title, abstract, and full-text screening with a third arbiter of disagreement. Two reviewers assigned the Joanna Briggs Institute 10-point checklist for qualitative research scores for each study. A single reviewer extracted free-text data relevant to clinical AI implementation, noting the stakeholders contributing to each excerpt. The best-fit framework synthesis used the Nonadoption, Abandonment, Scale-up, Spread, and Sustainability (NASSS) framework. To validate the data and improve accessibility, coauthors representing each emergent stakeholder group codeveloped summaries of the factors most relevant to their respective groups. RESULTS: The initial search yielded 4437 deduplicated articles, with 111 (2.5%) eligible for inclusion (median Joanna Briggs Institute 10-point checklist for qualitative research score, 8/10). Five distinct stakeholder groups emerged from the data: health care professionals (HCPs), patients, carers and other members of the public, developers, health care managers and leaders, and regulators or policy makers, contributing 1204 (70%), 196 (11.4%), 133 (7.7%), 129 (7.5%), and 59 (3.4%) of 1721 eligible excerpts, respectively. All stakeholder groups independently identified a breadth of implementation factors, with each producing data that were mapped between 17 and 24 of the 27 adapted Nonadoption, Abandonment, Scale-up, Spread, and Sustainability subdomains. Most of the factors that stakeholders found influential in the implementation of rule-based clinical AI also applied to non–rule-based clinical AI, with the exception of intellectual property, regulation, and sociocultural attitudes. CONCLUSIONS: Clinical AI implementation is influenced by many interdependent factors, which are in turn influenced by at least 5 distinct stakeholder groups. This implies that effective research and practice of clinical AI implementation should consider multiple stakeholder perspectives. The current underrepresentation of perspectives from stakeholders other than HCPs in the literature may limit the anticipation and management of the factors that influence successful clinical AI implementation. Future research should not only widen the representation of tools and contexts in qualitative research but also specifically investigate the perspectives of all stakeholder HCPs and emerging aspects of non–rule-based clinical AI implementation. View details
    Clinically Applicable Segmentation of Head and Neck Anatomy for Radiotherapy: Deep Learning Algorithm Development and Validation Study
    Stanislav Nikolov
    Sam Blackwell
    Alexei Zverovitch
    Ruheena Mendes
    Michelle Livne
    Jeffrey De Fauw
    Yojan Patel
    Clemens Meyer
    Harry Askham
    Bernardino Romera Paredes
    Carlton Chu
    Dawn Carnell
    Cheng Boon
    Derek D'Souza
    Syed Moinuddin
    Yasmin Mcquinlan
    Sarah Ireland
    Kiarna Hampton
    Krystle Fuller
    Hugh Montgomery
    Geraint Rees
    Mustafa Suleyman
    Trevor John Back
    Cían Hughes
    Olaf Ronneberger
    JMIR(2021)
    Preview abstract Background: Over half a million individuals are diagnosed with head and neck cancer each year globally. Radiotherapy is an important curative treatment for this disease, but it requires manual time to delineate radiosensitive organs at risk. This planning process can delay treatment while also introducing interoperator variability, resulting in downstream radiation dose differences. Although auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying, and achieving expert performance remain. Objective: Adopting a deep learning approach, we aim to demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck organs at risk commonly segmented in clinical practice. Methods: The model was trained on a data set of 663 deidentified computed tomography scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus organ at risk definitions. Results: We demonstrated the model’s clinical applicability by assessing its performance on a test set of 21 computed tomography scans from clinical practice, each with 21 organs at risk segmented by 2 independent experts. We also introduced surface Dice similarity coefficient, a new metric for the comparison of organ delineation, to quantify the deviation between organ at risk surface contours rather than volumes, better reflecting the clinical task of correcting errors in automated organ segmentations. The model’s generalizability was then demonstrated on 2 distinct open-source data sets, reflecting different centers and countries to model training. Conclusions: Deep learning is an effective and clinically applicable technique for the segmentation of the head and neck anatomy for radiotherapy. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways. View details
    Artificial Intelligence in Pediatrics
    Alex Brown
    Jim Taylor
    Artificial Intelligence in Medicine, Springer(2021)
    Preview abstract Pediatrics is a specialty with significant promise for the application of artificial intelligence (AI) technologies, in part due to the richness of its datasets, with relatively more complete longitudinal records and often less heterogeneous patterns of disease compared to adult medicine. Despite considerable overlap with adult medicine, pediatrics presents a distinct set of clinical problems to solve. It is tempting to assume that AI tools developed for adults will easily translate to the pediatric population, where in reality this is unlikely to be the case. The challenges involved in the development of AI tools for healthcare are unfortunately exacerbated in pediatrics, and the implementation gap between how these systems are developed and the setting in which they will be deployed is a real challenge for the next decade. Robust evaluation through high quality clinical study design and clear reporting standards will be essential. This chapter reviews recent work to develop artificial intelligence solutions in pediatrics, including developments across cardiology, respiratory, gastroenterology, neonatology, genetics, endocrinology, ophthalmology, radiology, pediatric intensive care, and radiology specialties. We conclude that AI presents an exciting opportunity to transform aspects of pediatrics at a global scale, democratizing access to subspecialist diagnostic skills, improving quality and efficiency of care, enabling global access to healthcare through sensor-rich Internet-connected mobile devices, and enhancing imaging acquisition to reduce radiation while improving speed and quality. The ultimate challenge will be for pediatricians to find ways to deploy these novel technologies into clinical practice in a way that is safe, effective, and equitable and that ultimately improves outcomes for children. View details
    Quantitative Analysis of OCT for Neovascular Age-Related Macular Degeneration Using Deep Learning
    Gabriella Moraes
    Dun Jack Fu
    Marc Wilson
    Hagar Khalid
    Siegfried Wagner
    Edward Korot
    Robbert Struyven
    Daniel Ferraz
    Livia Faes
    Terry Spitz
    Praveen Patel
    Konstantinos Balaskas
    Tiarnan Keenan
    Pearse Keane
    Reena Chopra
    Ophthalmology(2021)
    Preview abstract Purpose: To apply a deep learning algorithm for automated, objective, and comprehensive quantification of OCT scans to a large real-world dataset of eyes with neovascular age-related macular degeneration (AMD) and make the raw segmentation output data openly available for further research. Design: Retrospective analysis of OCT images from the Moorfields Eye Hospital AMD Database. Participants A total of 2473 first-treated eyes and 493 second-treated eyes that commenced therapy for neovascular AMD between June 2012 and June 2017. Methods: A deep learning algorithm was used to segment all baseline OCT scans. Volumes were calculated for segmented features such as neurosensory retina (NSR), drusen, intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), retinal pigment epithelium (RPE), hyperreflective foci (HRF), fibrovascular pigment epithelium detachment (fvPED), and serous PED (sPED). Analyses included comparisons between first- and second-treated eyes by visual acuity (VA) and race/ethnicity and correlations between volumes. Main Outcome Measures: Volumes of segmented features (mm3) and central subfield thickness (CST) (μm). Results: In first-treated eyes, the majority had both IRF and SRF (54.7%). First-treated eyes had greater volumes for all segmented tissues, with the exception of drusen, which was greater in second-treated eyes. In first-treated eyes, older age was associated with lower volumes for RPE, SRF, NSR, and sPED; in second-treated eyes, older age was associated with lower volumes of NSR, RPE, sPED, fvPED, and SRF. Eyes from Black individuals had higher SRF, RPE, and serous PED volumes compared with other ethnic groups. Greater volumes of the majority of features were associated with worse VA. Conclusions: We report the results of large-scale automated quantification of a novel range of baseline features in neovascular AMD. Major differences between first- and second-treated eyes, with increasing age, and between ethnicities are highlighted. In the coming years, enhanced, automated OCT segmentation may assist personalization of real-world care and the detection of novel structure–function correlations. These data will be made publicly available for replication and future investigation by the AMD research community. View details
    Validation and Clinical Applicability of Whole-Volume Automated Segmentation of Optical Coherence Tomography in Retinal Disease Using Deep Learning
    Marc Wilson
    Reena Chopra
    Megan Zoë Wilson
    Charlotte Cooper
    Patricia MacWilliams
    Daniela Florea
    Cían Hughes
    Hagar Khalid
    Sandra Vermeirsch
    Luke Nicholson
    Pearse Keane
    Konstantinos Balaskas
    JAMA Ophthalmology(2021)
    Preview abstract Importance Quantitative volumetric measures of retinal disease in optical coherence tomography (OCT) scans are infeasible to perform owing to the time required for manual grading. Expert-level deep learning systems for automatic OCT segmentation have recently been developed. However, the potential clinical applicability of these systems is largely unknown. Objective To evaluate a deep learning model for whole-volume segmentation of 4 clinically important pathological features and assess clinical applicability. Design, Setting, Participants This diagnostic study used OCT data from 173 patients with a total of 15 558 B-scans, treated at Moorfields Eye Hospital. The data set included 2 common OCT devices and 2 macular conditions: wet age-related macular degeneration (107 scans) and diabetic macular edema (66 scans), covering the full range of severity, and from 3 points during treatment. Two expert graders performed pixel-level segmentations of intraretinal fluid, subretinal fluid, subretinal hyperreflective material, and pigment epithelial detachment, including all B-scans in each OCT volume, taking as long as 50 hours per scan. Quantitative evaluation of whole-volume model segmentations was performed. Qualitative evaluation of clinical applicability by 3 retinal experts was also conducted. Data were collected from June 1, 2012, to January 31, 2017, for set 1 and from January 1 to December 31, 2017, for set 2; graded between November 2018 and January 2020; and analyzed from February 2020 to November 2020. Main Outcomes and Measures Rating and stack ranking for clinical applicability by retinal specialists, model-grader agreement for voxelwise segmentations, and total volume evaluated using Dice similarity coefficients, Bland-Altman plots, and intraclass correlation coefficients. Results Among the 173 patients included in the analysis (92 [53%] women), qualitative assessment found that automated whole-volume segmentation ranked better than or comparable to at least 1 expert grader in 127 scans (73%; 95% CI, 66%-79%). A neutral or positive rating was given to 135 model segmentations (78%; 95% CI, 71%-84%) and 309 expert gradings (2 per scan) (89%; 95% CI, 86%-92%). The model was rated neutrally or positively in 86% to 92% of diabetic macular edema scans and 53% to 87% of age-related macular degeneration scans. Intraclass correlations ranged from 0.33 (95% CI, 0.08-0.96) to 0.96 (95% CI, 0.90-0.99). Dice similarity coefficients ranged from 0.43 (95% CI, 0.29-0.66) to 0.78 (95% CI, 0.57-0.85). Conclusions and Relevance This deep learning–based segmentation tool provided clinically useful measures of retinal disease that would otherwise be infeasible to obtain. Qualitative evaluation was additionally important to reveal clinical applicability for both care management and research. View details