Jump to Content
Jim Winkens

Jim Winkens

Researcher in the Google Health Research team in London, working on safety-first and collaborative AI, uncertainty estimation and data efficiency.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Enhancing diagnostic accuracy of medical AI systems via selective deferral to clinicians
    Dj Dvijotham
    Melih Barsbey
    Sumedh Ghaisas
    Robert Stanforth
    Nick Pawlowski
    Patricia Strachan
    Zahra Ahmed
    Yoram Bachrach
    Laura Culp
    Mayank Daswani
    Jan Freyberg
    Atilla Kiraly
    Timo Kohlberger
    Scott Mayer McKinney
    Basil Mustafa
    Krzysztof Geras
    Jan Witowski
    Zhi Zhen Qin
    Jacob Creswell
    Shravya Shetty
    Terry Spitz
    Taylan Cemgil
    Nature Medicine (2023)
    Preview abstract AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. While these results are impressive, they don’t accurately reflect the impact of deployment of such systems in a clinical context. Due to the safety-critical nature of this domain and the fact that AI systems are not perfect and can make inaccurate assessments, they are predominantly deployed as assistive tools for clinical experts3. Although clinicians routinely discuss the diagnostic nuances of medical images with each other, weighing human diagnostic confidence against that of an AI system remains a major unsolved barrier to collaborative decision-making4. Furthermore, it has been observed that diagnostic AI models have complementary strengths and weaknesses compared to clinical experts. Yet, complementarity and the assessment of relative confidence between the members of a diagnostic team has remained largely unexploited in how AI systems are currently used in medical settings5. In this paper, we study the behavior of a team composed of diagnostic AI model(s) and clinician(s) in diagnosing disease. To go beyond the performance level of a standalone AI system, we develop a novel selective deferral algorithm that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinical expert. Using this algorithm, we demonstrate that the composite AI+human system has enhanced accuracy (both sensitivity and specificity) relative to a human-only or an AI-only baseline. We decouple the development of the deferral AI model from training of the underlying diagnostic AI model(s). Development of the deferral AI model only requires i) the predictions of a model(s) on a tuning set of medical images (separate from the diagnostic AI models’ training data), ii) the diagnoses made by clinicians on these images and iii) the ground truth disease labels corresponding to those images. Our extensive analysis shows that the selective deferral (SD) system exceeds the performance of either clinicians or AI alone in multiple clinical settings: breast and lung cancer screening. For breast cancer screening, double-reading with arbitration (two readers interpreting each mammogram invoking an arbitrator if needed) is a “gold standard” for performance, never previously exceeded using AI6. The SD system exceeds the accuracy of double-reading with arbitration in a large representative UK screening program (25% reduction in false positives despite equivalent true-positive detection and 66% reduction in the requirement for clinicians to read an image), as well as exceeding the performance of a standalone state-of-art AI system (40% reduction in false positives with equivalent detection of true positives). In a large US dataset the SD system exceeds the accuracy of single-reading by board-certified radiologists and a standalone state-of-art AI system (32% reduction in false positives despite equivalent detection of true positives and 55% reduction in the clinician workload required). The SD system further outperforms both clinical experts alone, and AI alone for the detection of lung cancer in low-dose Computed Tomography images from a large national screening study, with 11% reduction in false positives while maintaining sensitivity given 93% reduction in clinician workload required. Furthermore, the SD system allows controllable trade-offs between sensitivity and specificity and can be tuned to target either specificity or sensitivity as desired for a particular clinical application, or a combination of both. The system generalizes to multiple distribution shifts, retaining superiority to both the AI system alone and human experts alone. We demonstrate that the SD system retains performance gains even on clinicians not present in the training data for the deferral AI. Furthermore, we test the SD system on a new population where the standalone AI system’s performance significantly degrades. We showcase the few-shot adaptation capability of the SD system by demonstrating that the SD system can obtain superiority to both the standalone AI system and the clinician on the new population after being trained on only 40 cases from the new population. Our comprehensive assessment demonstrates that a selective deferral system could significantly improve clinical outcomes in multiple medical imaging applications, paving the way for higher performance clinical AI systems that can leverage the complementarity between clinical experts and medical AI tools. View details
    Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
    Laura Anne Culp
    Jan Freyberg
    Basil Mustafa
    Sebastien Baur
    Simon Kornblith
    Ting Chen
    Patricia MacWilliams
    Sara Mahdavi
    Megan Zoë Walker
    Aaron Loh
    Cameron Chen
    Scott Mayer McKinney
    Zach William Beaver
    Fiona Keleher Ryan
    Mozziyar Etemadi
    Umesh Telang
    Lily Hao Yi Peng
    Geoffrey Everest Hinton
    Mohammad Norouzi
    Nature Biomedical Engineering (2023)
    Preview abstract Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging. View details
    Preview abstract Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To avoid models generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent `outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between model train, validation, and test sets. Unlike most traditional OOD benchmarks which detect dataset distribution shift, we aim at detecting semantic differences, often referred to as near-OOD detection which is a more difficult task. We propose a novel hierarchical outlier detection (HOD) approach, which assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers \vs{} outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD outperforms existing techniques for outlier exposure based OOD detection. We also use different state-of-the-art representation learning approaches (BiT-JFT, SimCLR, MICLe) to improve OOD performance and demonstrate the effectiveness of HOD loss for them. Further, we explore different ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also performed a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrated the gains of our framework in comparison to baselines. Furthermore, we go beyond traditional performance metrics and introduce a cost metric to approximate downstream clinical impact. We used this cost metric to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world deployment scenarios. View details
    Supervised Transfer Learning at Scale for Medical Imaging
    Aaron Loh
    Basil Mustafa
    Jan Freyberg
    Patricia MacWilliams
    Megan Wilson
    Scott Mayer McKinney
    Peggy Bui
    Umesh Telang
    ArXiV (2021)
    Preview abstract Transfer learning is a standard building block of successful medical imaging models, yet previous efforts suggest that at limited scale of pre-training data and model capacity, benefits of transfer learning to medical imaging are insubstantial. In this work, we explore whether scaling up pre-training can help improve transfer to medical tasks. In particular, we show that when using the Big Transfer recipe to further scale up pre-training, we can indeed considerably improve transfer performance across three popular yet diverse medical imaging tasks - interpretation of chest radiographs, breast cancer detection from mammograms and skin condition detection from smartphone images. Despite pre-training on unrelated source domains, we show that scaling up the model capacity and pre-training data yields performance improvements regardless of how much downstream medical data is available. In particular, we show suprisingly large improvements to zero-shot generalisation under distribution shift. Probing and quantifying other aspects of model performance relevant to medical imaging and healthcare, we demonstrate that these gains do not come at the expense of model calibration or fairness. View details
    Predicting conversion to wet age-related macular degeneration using deep learning
    Jason Yim
    Reena Chopra
    Terry Spitz
    Annette Obika
    Harry Askham
    Marko Lukic
    Josef Huemer
    Katrin Fasler
    Gabriella Moraes
    Clemens Meyer
    Marc Wilson
    Jonathan Mark Dixon
    Cían Hughes
    Geraint Rees
    Peng Khaw
    Dominic King
    Demis Hassabis
    Mustafa Suleyman
    Trevor John Back
    Pearse Keane
    Jeffrey De Fauw
    Nature Medicine (2020)
    Preview abstract Progression to exudative ‘wet’ age-related macular degeneration (exAMD) is a major cause of visual deterioration. In patients diagnosed with exAMD in one eye, we introduce an artificial intelligence (AI) system to predict progression to exAMD in the second eye. By combining models based on three-dimensional (3D) optical coherence tomography images and corresponding automatic tissue maps, our system predicts conversion to exAMD within a clinically actionable 6-month time window, achieving a per-volumetric-scan sensitivity of 80% at 55% specificity, and 34% sensitivity at 90% specificity. This level of performance corresponds to true positives in 78% and 41% of individual eyes, and false positives in 56% and 17% of individual eyes at the high sensitivity and high specificity points, respectively. Moreover, we show that automatic tissue segmentation can identify anatomical changes before conversion and high-risk subgroups. This AI system overcomes substantial interobserver variability in expert predictions, performing better than five out of six experts, and demonstrates the potential of using AI to predict disease progression. View details
    No Results Found