Jump to Content
Shekoofeh Azizi

Shekoofeh Azizi

My research is focused on developing simple and efficient machine learning algorithms that are broadly applicable for analysis of a range of medical image modalities. These algorithms can accelerate the translation of AI solutions to clinical impact and scaling world class healthcare to everyone.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details
    Preview abstract Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all. View details
    Generative models improve fairness of medical classifiers under distribution shifts
    Ira Ktena
    Olivia Wiles
    Isabela Albuquerque
    Sylvestre-Alvise Rebuffi
    Ryutaro Tanno
    Danielle Belgrave
    Taylan Cemgil
    Nature Medicine (2024)
    Preview abstract Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution. View details
    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
    Ryutaro Tanno
    David Barrett
    Sumedh Ghaisas
    Sumanth Dathathri
    Abi See
    Johannes Welbl
    Karan Singhal
    Rhys May
    Roy Lee
    SiWai Man
    Zahra Ahmed
    Sara Mahdavi
    Joelle Barral
    Ali Eslami
    Danielle Belgrave
    Shravya Shetty
    Po-Sen Huang
    Ira Ktena
    Arxiv (2023)
    Preview abstract Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60% of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80% of in-patient cases and 60% of intensive care cases. View details
    Predicting lymph node metastasis from primary tumor histology and clinicopathologic factors in colorectal cancer using deep learning
    Fraser Tan
    Isabelle Flament-Auvigne
    Trissia Brown
    Markus Plass
    Robert Reihs
    Heimo Mueller
    Kurt Zatloukal
    Pema Richeson
    Lily Peng
    Craig Mermel
    Cameron Chen
    Saurabh Gombar
    Thomas Montine
    Jeanne Shen
    Nature Communications Medicine, vol. 3 (2023), pp. 59
    Preview abstract Background: Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors. Methods: Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables. Results: The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III). Conclusion: This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts. View details
    Large Language Models Encode Clinical Knowledge
    Karan Singhal
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Nature (2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details
    Enhancing diagnostic accuracy of medical AI systems via selective deferral to clinicians
    Dj Dvijotham
    Melih Barsbey
    Sumedh Ghaisas
    Robert Stanforth
    Nick Pawlowski
    Patricia Strachan
    Zahra Ahmed
    Yoram Bachrach
    Laura Culp
    Mayank Daswani
    Jan Freyberg
    Atilla Kiraly
    Timo Kohlberger
    Scott Mayer McKinney
    Basil Mustafa
    Krzysztof Geras
    Jan Witowski
    Zhi Zhen Qin
    Jacob Creswell
    Shravya Shetty
    Terry Spitz
    Taylan Cemgil
    Nature Medicine (2023)
    Preview abstract AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. While these results are impressive, they don’t accurately reflect the impact of deployment of such systems in a clinical context. Due to the safety-critical nature of this domain and the fact that AI systems are not perfect and can make inaccurate assessments, they are predominantly deployed as assistive tools for clinical experts3. Although clinicians routinely discuss the diagnostic nuances of medical images with each other, weighing human diagnostic confidence against that of an AI system remains a major unsolved barrier to collaborative decision-making4. Furthermore, it has been observed that diagnostic AI models have complementary strengths and weaknesses compared to clinical experts. Yet, complementarity and the assessment of relative confidence between the members of a diagnostic team has remained largely unexploited in how AI systems are currently used in medical settings5. In this paper, we study the behavior of a team composed of diagnostic AI model(s) and clinician(s) in diagnosing disease. To go beyond the performance level of a standalone AI system, we develop a novel selective deferral algorithm that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinical expert. Using this algorithm, we demonstrate that the composite AI+human system has enhanced accuracy (both sensitivity and specificity) relative to a human-only or an AI-only baseline. We decouple the development of the deferral AI model from training of the underlying diagnostic AI model(s). Development of the deferral AI model only requires i) the predictions of a model(s) on a tuning set of medical images (separate from the diagnostic AI models’ training data), ii) the diagnoses made by clinicians on these images and iii) the ground truth disease labels corresponding to those images. Our extensive analysis shows that the selective deferral (SD) system exceeds the performance of either clinicians or AI alone in multiple clinical settings: breast and lung cancer screening. For breast cancer screening, double-reading with arbitration (two readers interpreting each mammogram invoking an arbitrator if needed) is a “gold standard” for performance, never previously exceeded using AI6. The SD system exceeds the accuracy of double-reading with arbitration in a large representative UK screening program (25% reduction in false positives despite equivalent true-positive detection and 66% reduction in the requirement for clinicians to read an image), as well as exceeding the performance of a standalone state-of-art AI system (40% reduction in false positives with equivalent detection of true positives). In a large US dataset the SD system exceeds the accuracy of single-reading by board-certified radiologists and a standalone state-of-art AI system (32% reduction in false positives despite equivalent detection of true positives and 55% reduction in the clinician workload required). The SD system further outperforms both clinical experts alone, and AI alone for the detection of lung cancer in low-dose Computed Tomography images from a large national screening study, with 11% reduction in false positives while maintaining sensitivity given 93% reduction in clinician workload required. Furthermore, the SD system allows controllable trade-offs between sensitivity and specificity and can be tuned to target either specificity or sensitivity as desired for a particular clinical application, or a combination of both. The system generalizes to multiple distribution shifts, retaining superiority to both the AI system alone and human experts alone. We demonstrate that the SD system retains performance gains even on clinicians not present in the training data for the deferral AI. Furthermore, we test the SD system on a new population where the standalone AI system’s performance significantly degrades. We showcase the few-shot adaptation capability of the SD system by demonstrating that the SD system can obtain superiority to both the standalone AI system and the clinician on the new population after being trained on only 40 cases from the new population. Our comprehensive assessment demonstrates that a selective deferral system could significantly improve clinical outcomes in multiple medical imaging applications, paving the way for higher performance clinical AI systems that can leverage the complementarity between clinical experts and medical AI tools. View details
    Towards Accurate Differential Diagnosis with Large Language Models
    Daniel McDuff
    Anil Palepu
    Amy Wang
    Karan Singhal
    Yash Sharma
    Kavita Kulkarni
    Le Hou
    Sara Mahdavi
    Sushant Prakash
    Anupam Pathak
    Shwetak Patel
    Ewa Dominowska
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Jake Sunshine
    Arxiv (2023)
    Preview abstract An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise. View details
    Preview abstract Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering. View details
    Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
    Laura Anne Culp
    Jan Freyberg
    Basil Mustafa
    Sebastien Baur
    Simon Kornblith
    Ting Chen
    Patricia MacWilliams
    Sara Mahdavi
    Megan Zoë Walker
    Aaron Loh
    Cameron Chen
    Scott Mayer McKinney
    Zach William Beaver
    Fiona Keleher Ryan
    Mozziyar Etemadi
    Umesh Telang
    Lily Hao Yi Peng
    Geoffrey Everest Hinton
    Mohammad Norouzi
    Nature Biomedical Engineering (2023)
    Preview abstract Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging. View details
    Joint Debiased Representation and Image Clustering Learning with Self-Supervision
    Fabian Zheng
    JaeEun Nam
    Emilio Dorigatti
    Bernd Bischl
    Mina Rezaei
    arXiv preprint arXiv:2209.06941 (2022)
    Preview abstract Contrastive learning is among the most successful methods for visual representation learning, and its performance can be further improved by jointly performing clustering on the learned representations. However, existing methods for joint clustering and contrastive learning do not perform well on long-tailed data distributions, as majority classes overwhelm and distort the loss of minority classes, thus preventing meaningful representations to be learned. Motivated by this, we develop a novel joint clustering and contrastive learning framework by adapting the debiased contrastive loss to avoid under-clustering minority classes of imbalanced datasets. We show that our proposed modified debiased contrastive loss and divergence clustering loss improves the performance across multiple datasets and learning tasks. The source code is available at \url{https://anonymous.4open.science/r/SSL-debiased-clustering} View details
    Big Self-Supervised Models Advance Medical Image Classification
    Basil Mustafa
    Fiona Ryan
    Zachary Beaver
    Jan Freyberg
    Jonathan Deaton
    Aaron Loh
    Simon Kornblith
    Ting Chen
    Mohammad Norouzi
    International Conference on Computer Vision (2021)
    Preview abstract Self-supervised pretraining followed by supervised fine-tuning has seen success in image recognition, especially when labeled examples are scarce, but has received limited attention in medical image analysis. This paper studies the effectiveness of self-supervised learning as a pretraining strategy for medical image classification. We conduct experiments on two distinct tasks: dermatology skin condition classification from digital camera images and multi-label chest X-ray classification, and demonstrate that self-supervised learning on ImageNet, followed by additional self-supervised learning on unlabeled domain-specific medical images significantly improves the accuracy of medical image classifiers. We introduce a novel Multi-Instance Contrastive Learning (MICLe) method that uses multiple images of the underlying pathology per patient case, when available, to construct more informative positive pairs for self-supervised learning. Combining our contributions, we achieve an improvement of 6.7% in top-1 accuracy and an improvement of 1.1% in mean AUC on dermatology and chest X-ray classification respectively, outperforming strong supervised baselines pretrained on ImageNet. In addition, we show that big self-supervised models are robust to distribution shift and can learn efficiently with a small number of labeled medical images. View details
    Preview abstract Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To avoid models generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent `outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between model train, validation, and test sets. Unlike most traditional OOD benchmarks which detect dataset distribution shift, we aim at detecting semantic differences, often referred to as near-OOD detection which is a more difficult task. We propose a novel hierarchical outlier detection (HOD) approach, which assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers \vs{} outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD outperforms existing techniques for outlier exposure based OOD detection. We also use different state-of-the-art representation learning approaches (BiT-JFT, SimCLR, MICLe) to improve OOD performance and demonstrate the effectiveness of HOD loss for them. Further, we explore different ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also performed a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrated the gains of our framework in comparison to baselines. Furthermore, we go beyond traditional performance metrics and introduce a cost metric to approximate downstream clinical impact. We used this cost metric to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world deployment scenarios. View details
    Preview abstract Deep Bregman divergence measure divergence of data points using neural networks which is beyond Euclidean distance and capable of capturing divergence over distributions. In this paper, we propose deep Bregman divergences for contrastive learning of visual representation where we aim to enhance contrastive loss used in self-supervised learning by training additional network based on functional Bregman divergence. In contrast to the conventional contrastive learning methods which are solely based on divergences between single points, our framework can capture the divergence between distributions which improves the quality of learned representation. By combining conventional contrastive loss with the proposed contrastive divergence loss, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on multiple classification and object detection tasks and datasets. View details
    No Results Found