Jump to Content
Greg Corrado

Greg Corrado

Greg Corrado is a senior research scientist interested in biological neuroscience, artificial intelligence, and scalable machine learning. He has published in fields ranging across behavioral economics, neuromorphic device physics, systems neuroscience, and deep learning. At Google he has worked for some time on brain inspired computing, and most recently has served as one of the founding members and the co-technical lead of Google's large scale deep neural networks project.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
    Mike Schaekermann
    Terry Spitz
    Malcolm Chelliah
    Heather Cole-Lewis
    Yuan Liu
    Stephanie Farquhar
    Qinghan Xue
    Jenna Lester
    Cían Hughes
    Patricia Strachan
    Fraser Tan
    Peggy Bui
    Craig Mermel
    Lily Peng
    Sunny Virmani
    Christopher Semturs
    Ivor Horn
    Cameron Chen
    The Lancet eClinicalMedicine (2024)
    Preview abstract Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. View details
    A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
    Heather Cole-Lewis
    Rory Sayres
    Nenad Tomašev
    Liam McCoy
    Leo Anthony Celi
    Mike Schaekermann
    Alanna Walton
    Preeti Singh
    Akeiylah DeWitt
    Philip Mansfield
    Sushant Prakash
    Christopher Semturs
    Joelle Barral
    Ivor Horn
    Karan Singhal
    Arxiv (2024) (to appear)
    Preview abstract Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Mike Schaekermann
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Christopher Semturs
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Differences between Patient and Clinician Submitted Images: Implications for Virtual Care of Skin Conditions
    Rajeev Rikhye
    Grace Eunhae Hong
    Preeti Singh
    Margaret Ann Smith
    Aaron Loh
    Vijaytha Muralidharan
    Doris Wong
    Rory Sayres
    Michelle Phung
    Nicolas Betancourt
    Bradley Fong
    Rachna Sahasrabudhe
    Khoban Nasim
    Alec Eschholz
    Kat Chou
    Peggy Bui
    Yuan Liu
    Justin Ko
    Steven Lin
    Mayo Clinic Proceedings: Digital Health (2024)
    Preview abstract Objective: To understand and highlight the differences in clinical, demographic, and image quality characteristics between patient-taken (PAT) and clinic-taken (CLIN) photographs of skin conditions. Patients and Methods: This retrospective study applied logistic regression to data from 2500 deidentified cases in Stanford Health Care’s eConsult system, from November 2015 to January 2021. Cases with undiagnosable or multiple conditions or cases with both patient and clinician image sources were excluded, leaving 628 PAT cases and 1719 CLIN cases. Demographic characteristic factors, such as age and sex were self-reported, whereas anatomic location, estimated skin type, clinical signs and symptoms, condition duration, and condition frequency were summarized from patient health records. Image quality variables such as blur, lighting issues and whether the image contained skin, hair, or nails were estimated through a deep learning model. Results: Factors that were positively associated with CLIN photographs, post-2020 were as follows: age 60 years or older, darker skin types (eFST V/VI), and presence of skin growths. By contrast, factors that were positively associated with PAT photographs include conditions appearing intermittently, cases with blurry photographs, photographs with substantial nonskin (or nail/hair) regions and cases with more than 3 photographs. Within the PAT cohort, older age was associated with blurry photographs. Conclusion: There are various demographic, clinical, and image quality characteristic differences between PAT and CLIN photographs of skin concerns. The demographic characteristic differences present important considerations for improving digital literacy or access, whereas the image quality differences point to the need for improved patient education and better image capture workflows, particularly among elderly patients. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Mike Schaekermann
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Christopher Semturs
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
    Atilla Kiraly
    Corbin Cunningham
    Ryan Najafi
    Jie Yang
    Chuck Lau
    Diego Ardila
    Scott Mayer McKinney
    Rory Pilgrim
    Mozziyar Etemadi
    Sunny Jansen
    Lily Peng
    Shravya Shetty
    Neeral Beladia
    Krish Eswaran
    Radiology: Artificial Intelligence (2024)
    Preview abstract Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity. View details
    Crowdsourcing Dermatology Images with Google Search Ads: Creating a Diverse and Representative Dataset of Real-World Skin Conditions
    Abbi Ward
    Ashley Carrick
    Christopher Semturs
    Dawn Siegel
    Jay Hartford
    Jimmy Li
    Julie Wang
    Justin Ko
    Pradeep Kumar S
    Renee Wong
    Sriram Lakshminarasimhan
    Steven Lin
    Sunny Virmani
    arXiv (2024)
    Preview abstract Background Health datasets from clinical sources do not reflect the breadth and diversity of disease in the real world, impacting research, medical education and artificial intelligence (AI) tool development. Dermatology is a suitable area to develop and test a new and scalable method to create representative health datasets. Methods We used Google Search advertisements to solicit contributions of images of dermatology conditions, demographic and symptom information from internet users in the United States (US) over 265 days starting March 2023. With informed contributor consent, we described and released this dataset containing 10,106 images from 5058 contributions, with dermatologist labels as well as Fitzpatrick Skin Type and Monk Skin Tone labels for the images. Results We received 22 ± 14 submissions/day over 265 days. Female contributors (66.04%) and younger individuals (52.3% < age 40) had a higher representation in the dataset compared to the US population, and 36.6% of contributors had a non-White racial or ethnic identity. Over 97.5% of contributions were genuine images of skin conditions. Image quality had no impact on dermatologist confidence in assigning a differential diagnosis. The dataset consists largely of short duration (54% with onset < 7 days ago) allergic, infectious, and inflammatory conditions. Fitzpatrick skin type distribution is well-balanced, considering the geographical origin of the dataset and the absence of enrichment for population groups or skin tones. Interpretation Search ads are effective at crowdsourcing images of health conditions. The SCIN dataset bridges important gaps in the availability of representative images of common skin conditions. View details
    Preview abstract Importance: Interest in artificial intelligence (AI) has reached an all-time high, and health care leaders across the ecosystem are faced with questions about where, when, and how to deploy AI and how to understand its risks, problems, and possibilities. Observations: While AI as a concept has existed since the 1950s, all AI is not the same. Capabilities and risks of various kinds of AI differ markedly, and on examination 3 epochs of AI emerge. AI 1.0 includes symbolic AI, which attempts to encode human knowledge into computational rules, as well as probabilistic models. The era of AI 2.0 began with deep learning, in which models learn from examples labeled with ground truth. This era brought about many advances both in people’s daily lives and in health care. Deep learning models are task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction. AI 3.0 is the era of foundation models and generative AI. Models in AI 3.0 have fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations. These models can do many different kinds of tasks without being retrained on a new dataset. For example, a simple text instruction will change the model’s behavior. Prompts such as “Write this note for a specialist consultant” and “Write this note for the patient’s mother” will produce markedly different content. Conclusions and Relevance: Foundation models and generative AI represent a major revolution in AI’s capabilities, ffering tremendous potential to improve care. Health care leaders are making decisions about AI today. While any heuristic omits details and loses nuance, the framework of AI 1.0, 2.0, and 3.0 may be helpful to decision-makers because each epoch has fundamentally different capabilities and risks. View details
    An intentional approach to managing bias in embedding models
    Atilla P. Kiraly
    Alexander D'Amour
    Jungyeon Park
    Rory Pilgrim
    Charles Lau
    Heather Cole-Lewis
    Shravya Shetty
    Krish Eswaran
    Leo Anthony Celi
    The Lancet Digital Health, vol. 6 (2024), E126-E130
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details
    ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
    Shawn Xu
    Lin Yang
    Timo Kohlberger
    Martin Ma
    Atilla Kiraly
    Sahar Kazemzadeh
    Zakkai Melamed
    Jungyeon Park
    Patricia MacWilliams
    Chuck Lau
    Preeti Singh
    Christina Chen
    Mozziyar Etemadi
    Sreenivasa Raju Kalidindi
    Kat Chou
    Shravya Shetty
    Daniel Golden
    Rory Pilgrim
    Krish Eswaran
    arxiv (2023)
    Preview abstract Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. View details
    Towards Physician-Level Medical Question Answering with Large Language Models
    Karan Singhal
    Juro Gottweis
    Rory Sayres
    Le Hou
    Kevin Clark
    Heather Cole-Lewis
    Mike Schaekermann
    Amy Wang
    Sami Lachgar
    Philip Mansfield
    Sushant Prakash
    Bradley Green
    Ewa Dominowska
    Nenad Tomašev
    Renee Wong
    Christopher Semturs
    Sara Mahdavi
    Joelle Barral
    Arxiv (2023) (to appear)
    Preview abstract Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering. View details
    Towards Accurate Differential Diagnosis with Large Language Models
    Daniel McDuff
    Mike Schaekermann
    Anil Palepu
    Amy Wang
    Karan Singhal
    Yash Sharma
    Kavita Kulkarni
    Le Hou
    Sara Mahdavi
    Sushant Prakash
    Anupam Pathak
    Christopher Semturs
    Shwetak Patel
    Ewa Dominowska
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Jake Sunshine
    Arxiv (2023)
    Preview abstract An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise. View details
    Large Language Models Encode Clinical Knowledge
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Christopher Semturs
    Karan Singhal
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Nature (2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details
    Discovering novel systemic biomarkers in external eye photos
    Boris Babenko
    Ilana Traynis
    Christina Chen
    Preeti Singh
    Akib Uddin
    Jorge Cuadros
    Lauren P. Daskivich
    April Y. Maa
    Ramasamy Kim
    Eugene Yu-Chuan Kang
    Lily Peng
    Christopher Semturs
    Jonathan Krause
    Avinash Varadarajan
    Naama Hammel
    The Lancet Digital Health (2023)
    Preview abstract Background Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions. Methods We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes). Findings Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36.0 U/L, calcium <8.6 mg/dL, eGFR <60.0 mL/min/1.73 m2, haemoglobin <11.0 g/dL, platelets <150.0 × 103/μL, ACR ≥300 mg/g, and WBC <4.0 × 103/μL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5.3–19.9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300.0 mg/g and haemoglobin <11.0 g/dL by 7.3–13.2%. Interpretation We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications. View details
    Preview abstract Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications. View details
    Enhancing diagnostic accuracy of medical AI systems via selective deferral to clinicians
    Dj Dvijotham
    Melih Barsbey
    Sumedh Ghaisas
    Robert Stanforth
    Nick Pawlowski
    Patricia Strachan
    Zahra Ahmed
    Yoram Bachrach
    Laura Culp
    Mayank Daswani
    Jan Freyberg
    Atilla Kiraly
    Timo Kohlberger
    Scott Mayer McKinney
    Basil Mustafa
    Krzysztof Geras
    Jan Witowski
    Zhi Zhen Qin
    Jacob Creswell
    Shravya Shetty
    Terry Spitz
    Taylan Cemgil
    Nature Medicine (2023)
    Preview abstract AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. While these results are impressive, they don’t accurately reflect the impact of deployment of such systems in a clinical context. Due to the safety-critical nature of this domain and the fact that AI systems are not perfect and can make inaccurate assessments, they are predominantly deployed as assistive tools for clinical experts3. Although clinicians routinely discuss the diagnostic nuances of medical images with each other, weighing human diagnostic confidence against that of an AI system remains a major unsolved barrier to collaborative decision-making4. Furthermore, it has been observed that diagnostic AI models have complementary strengths and weaknesses compared to clinical experts. Yet, complementarity and the assessment of relative confidence between the members of a diagnostic team has remained largely unexploited in how AI systems are currently used in medical settings5. In this paper, we study the behavior of a team composed of diagnostic AI model(s) and clinician(s) in diagnosing disease. To go beyond the performance level of a standalone AI system, we develop a novel selective deferral algorithm that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinical expert. Using this algorithm, we demonstrate that the composite AI+human system has enhanced accuracy (both sensitivity and specificity) relative to a human-only or an AI-only baseline. We decouple the development of the deferral AI model from training of the underlying diagnostic AI model(s). Development of the deferral AI model only requires i) the predictions of a model(s) on a tuning set of medical images (separate from the diagnostic AI models’ training data), ii) the diagnoses made by clinicians on these images and iii) the ground truth disease labels corresponding to those images. Our extensive analysis shows that the selective deferral (SD) system exceeds the performance of either clinicians or AI alone in multiple clinical settings: breast and lung cancer screening. For breast cancer screening, double-reading with arbitration (two readers interpreting each mammogram invoking an arbitrator if needed) is a “gold standard” for performance, never previously exceeded using AI6. The SD system exceeds the accuracy of double-reading with arbitration in a large representative UK screening program (25% reduction in false positives despite equivalent true-positive detection and 66% reduction in the requirement for clinicians to read an image), as well as exceeding the performance of a standalone state-of-art AI system (40% reduction in false positives with equivalent detection of true positives). In a large US dataset the SD system exceeds the accuracy of single-reading by board-certified radiologists and a standalone state-of-art AI system (32% reduction in false positives despite equivalent detection of true positives and 55% reduction in the clinician workload required). The SD system further outperforms both clinical experts alone, and AI alone for the detection of lung cancer in low-dose Computed Tomography images from a large national screening study, with 11% reduction in false positives while maintaining sensitivity given 93% reduction in clinician workload required. Furthermore, the SD system allows controllable trade-offs between sensitivity and specificity and can be tuned to target either specificity or sensitivity as desired for a particular clinical application, or a combination of both. The system generalizes to multiple distribution shifts, retaining superiority to both the AI system alone and human experts alone. We demonstrate that the SD system retains performance gains even on clinicians not present in the training data for the deferral AI. Furthermore, we test the SD system on a new population where the standalone AI system’s performance significantly degrades. We showcase the few-shot adaptation capability of the SD system by demonstrating that the SD system can obtain superiority to both the standalone AI system and the clinician on the new population after being trained on only 40 cases from the new population. Our comprehensive assessment demonstrates that a selective deferral system could significantly improve clinical outcomes in multiple medical imaging applications, paving the way for higher performance clinical AI systems that can leverage the complementarity between clinical experts and medical AI tools. View details
    Lessons learned from translating AI from development to deployment in healthcare
    Sunny Virmani
    Jonathan Krause
    Jay Nayar
    Elin Rønby Pedersen
    Divleen Jeji
    Naama Hammel
    Lily Peng
    Nature Medicine (2023)
    Preview abstract The application of an artificial intelligence (AI)-based screening tool for retinal disease in India and Thailand highlighted the myths and reality of introducing medical AI, which may form a framework for subsequent tools. View details
    Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
    Laura Anne Culp
    Jan Freyberg
    Basil Mustafa
    Sebastien Baur
    Simon Kornblith
    Ting Chen
    Patricia MacWilliams
    Sara Mahdavi
    Boris Babenko
    Megan Zoë Walker
    Aaron Loh
    Cameron Chen
    Yuan Liu
    Scott Mayer McKinney
    Zach William Beaver
    Fiona Keleher Ryan
    Mozziyar Etemadi
    Umesh Telang
    Lily Hao Yi Peng
    Geoffrey Everest Hinton
    Mohammad Norouzi
    Nature Biomedical Engineering (2023)
    Preview abstract Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging. View details
    Predicting lymph node metastasis from primary tumor histology and clinicopathologic factors in colorectal cancer using deep learning
    Fraser Tan
    Isabelle Flament-Auvigne
    Trissia Brown
    Markus Plass
    Robert Reihs
    Heimo Mueller
    Kurt Zatloukal
    Pema Richeson
    Lily Peng
    Craig Mermel
    Cameron Chen
    Saurabh Gombar
    Thomas Montine
    Jeanne Shen
    Nature Communications Medicine, vol. 3 (2023), pp. 59
    Preview abstract Background: Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors. Methods: Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables. Results: The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III). Conclusion: This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts. View details
    Pathologist Validation of a Machine Learning–Derived Feature for Colon Cancer Risk Stratification
    Vincenzo L’Imperio
    Markus Plass
    Heimo Müller
    Nicolò Tamini
    Luca Gianotti
    Nicola Zucchini
    Robert Reihs
    Lily Peng
    Cameron Chen
    Marialuisa Lavitrano
    David F. Steiner
    Kurt Zatloukal
    Fabio Pagni
    JAMA Network Open (2023)
    Preview abstract Importance: Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning–derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists. Objective: To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer. Design, Setting, and Participants: This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort. Main Outcomes and Measures: Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated. Results: A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80). Conclusions and Relevance: In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice. View details
    Multimodal LLMs for health grounded in individual-specific data
    Anastasiya Belyaeva
    Krish Eswaran
    Shravya Shetty
    Andrew Carroll
    Nick Furlotte
    ICML Workshop on Machine Learning for Multimodal Healthcare Data (2023)
    Preview abstract Large language models (LLMs) have shown an impressive ability to solve tasks in a wide range of fields including health. Within the health domain, there are many data modalities that are relevant to an individual’s health status. To effectively solve tasks related to individual health, LLMs will need the ability to use a diverse set of features as context. However, the best way to encode and inject complex high-dimensional features into the input stream of an LLM remains an active area of research. Here, we explore the ability of a foundation LLM to estimate disease risk given health-related input features. First, we evaluate serialization of structured individual-level health data into text along with in context learning and prompt tuning approaches. We find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to baseline after prompt tuning. Next, we propose a way to encode complex non-text data modalities into the token embedding space and then use this encoding to construct multimodal sentences. We show that this multimodal LLM achieves better or equivalent performance compared to baseline models. Overall, our results show the potential for using multi-modal LLMs grounded in individual health data to solve complex tasks such as risk prediction. View details
    Risk Stratification for Diabetic Retinopathy Screening Order Using Deep Learning: A Multicenter Prospective Study
    Ashish Bora
    Sunny Virmani
    Rayman Huang
    Ilana Traynis
    Lily Peng
    Avinash Varadarajan
    Warisara Pattanapongpaiboon
    Reena Chopra
    Dr. Paisan Raumviboonsuk
    Translational Vision Science & Technology (2023)
    Preview abstract Purpose: Real-world evaluation of a deep learning model that prioritizes patients based on risk of progression to moderate or worse (MOD+) diabetic retinopathy (DR). Methods: This nonrandomized, single-arm, prospective, interventional study included patients attending DR screening at four centers across Thailand from September 2019 to January 2020, with mild or no DR. Fundus photographs were input into the model, and patients were scheduled for their subsequent screening from September 2020 to January 2021 in order of predicted risk. Evaluation focused on model sensitivity, defined as correctly ranking patients that developed MOD+ within the first 50% of subsequent screens. Results: We analyzed 1,757 patients, of which 52 (3.0%) developed MOD+. Using the model-proposed order, the model's sensitivity was 90.4%. Both the model-proposed order and mild/no DR plus HbA1c had significantly higher sensitivity than the random order (P < 0.001). Excluding one major (rural) site that had practical implementation challenges, the remaining sites included 567 patients and 15 (2.6%) developed MOD+. Here, the model-proposed order achieved 86.7% versus 73.3% for the ranking that used DR grade and hemoglobin A1c. Conclusions: The model can help prioritize follow-up visits for the largest subgroups of DR patients (those with no or mild DR). Further research is needed to evaluate the impact on clinical management and outcomes. Translational Relevance: Deep learning demonstrated potential for risk stratification in DR screening. However, real-world practicalities must be resolved to fully realize the benefit. View details
    Race- and Ethnicity-Stratified Analysis of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners
    Ayush Jain
    David Way
    Vishakha Gupta
    Yi Gao
    Guilherme De Oliveira Marinho
    Jay David Hartford
    Rory Abbott Sayres
    Kimberly Kanada
    Clara Eng
    Kunal Nagpal
    Lily Hao Yi Peng
    Carter Dunn
    Susan Jen Huang
    Peggy Bui
    Yuan Liu
    (2022)
    Preview abstract Background: Many dermatologic cases are first evaluated by primary care physicians or nurse practitioners. Objective: This study aimed to evaluate an artificial intelligence (AI)-based tool that assists with interpreting dermatologic conditions. Methods: We developed an AI-based tool and conducted a randomized multi-reader, multi-case study (20 primary care physicians, 20 nurse practitioners, and 1047 retrospective teledermatology cases) to evaluate its utility. Cases were enriched and comprised 120 skin conditions. Readers were recruited to optimize for geographical diversity; the primary care physicians practiced across 12 states (2-32 years of experience, mean 11.3 years), and the nurse practitioners practiced across 9 states (2-34 years of experience, mean 13.1 years). To avoid memory effects from incomplete washout, each case was read once by each clinician either with or without AI assistance, with the assignment randomized. The primary analyses evaluated the top-1 agreement, defined as the agreement rate of the clinicians’ primary diagnosis with the reference diagnoses provided by a panel of dermatologists (per case: 3 dermatologists from a pool of 12, practicing across 8 states, with 5-13 years of experience, mean 7.2 years of experience). We additionally conducted subgroup analyses stratified by cases’ self-reported race and ethnicity and measured the performance spread: the maximum performance subtracted by the minimum across subgroups. Results: The AI’s standalone top-1 agreement was 63%, and AI assistance was significantly associated with higher agreement with reference diagnoses. For primary care physicians, the increase in diagnostic agreement was 10% (P<.001), from 48% to 58%; for nurse practitioners, the increase was 12% (P<.001), from 46% to 58%. When stratified by cases’ self-reported race or ethnicity, the AI’s performance was 59%-62% for Asian, Native Hawaiian, Pacific Islander, other, and Hispanic or Latinx individuals and 67% for both Black or African American and White subgroups. For the clinicians, AI assistance–associated improvements across subgroups were in the range of 8%-12% for primary care physicians and 8%-15% for nurse practitioners. The performance spread across subgroups was 5.3% unassisted vs 6.6% assisted for primary care physicians and 5.2% unassisted vs 6.0% assisted for nurse practitioners. In both unassisted and AI-assisted modalities, and for both primary care physicians and nurse practitioners, the subgroup with the highest performance on average was Black or African American individuals, though the differences with other subgroups were small and had overlapping 95% CIs. Conclusions: AI assistance was associated with significantly improved diagnostic agreement with dermatologists. Across race and ethnicity subgroups, for both primary care physicians and nurse practitioners, the effect of AI assistance remained high at 8%-15%, and the performance spread was similar at 5%-7%. View details
    Deep learning to detect optical coherence tomography-derived diabetic macular edema from retinal photographs: a multicenter validation study
    Xinle Sheila Liu
    Tayyeba Ali
    Preeti Singh
    Ami Shah
    Scott Mayer McKinney
    Paisan Ruamviboonsuk
    Angus W. Turner
    Pearse A. Keane
    Peranut Chotcomwongse
    Variya Nganthavee
    Mark Chia
    Josef Huemer
    Jorge Cuadros
    Rajiv Raman
    Lily Hao Yi Peng
    Naama Hammel
    Avinash Vaidyanathan Varadarajan
    Reena Chopra
    Ophthalmology Retina (2022)
    Preview abstract Purpose To validate the generalizability of a deep learning system (DLS) that detects diabetic macular edema (DME) from two-dimensional color fundus photography (CFP), where the reference standard for retinal thickness and fluid presence is derived from three-dimensional optical coherence tomography (OCT). Design Retrospective validation of a DLS across international datasets. Participants Paired CFP and OCT of patients from diabetic retinopathy (DR) screening programs or retina clinics. The DLS was developed using datasets from Thailand, the United Kingdom (UK) and the United States and validated using 3,060 unique eyes from 1,582 patients across screening populations in Australia, India and Thailand. The DLS was separately validated in 698 eyes from 537 screened patients in the UK with mild DR and suspicion of DME based on CFP. Methods The DLS was trained using DME labels from OCT. Presence of DME was based on retinal thickening or intraretinal fluid. The DLS’s performance was compared to expert grades of maculopathy and to a previous proof-of-concept version of the DLS. We further simulated integration of the current DLS into an algorithm trained to detect DR from CFPs. Main Outcome Measures Superiority of specificity and non-inferiority of sensitivity of the DLS for the detection of center-involving DME, using device specific thresholds, compared to experts. Results Primary analysis in a combined dataset spanning Australia, India, and Thailand showed the DLS had 80% specificity and 81% sensitivity compared to expert graders who had 59% specificity and 70% sensitivity. Relative to human experts, the DLS had significantly higher specificity (p=0.008) and non-inferior sensitivity (p<0.001). In the UK dataset the DLS had a specificity of 80% (p<0.001 for specificity > 50%) and a sensitivity of 100% (p=0.02 for sensitivity > 90%). Conclusions The DLS can generalize to multiple international populations with an accuracy exceeding experts. The clinical value of this DLS to reduce false positive referrals, thus decreasing the burden on specialist eye care, warrants prospective evaluation. View details
    Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
    Wouter Bulten
    Kimmo Kartasalo
    Po-Hsuan Cameron Chen
    Peter Ström
    Hans Pinckaers
    Kunal Nagpal
    Yuannan Cai
    Hester van Boven
    Robert Vink
    Christina Hulsbergen-van de Kaa
    Jeroen van der Laak
    Mahul B. Amin
    Andrew J. Evans
    Theodorus van der Kwast
    Robert Allan
    Peter A. Humphrey
    Henrik Grönberg
    Hemamali Samaratunga
    Brett Delahunt
    Toyonori Tsuzuki
    Tomi Häkkinen
    Lars Egevad
    Maggie Demkin
    Sohier Dane
    Fraser Tan
    Masi Valkonen
    Lily Peng
    Craig H. Mermel
    Pekka Ruusuvuori
    Geert Litjens
    Martin Eklund
    the PANDA challenge consortium
    Nature Medicine, vol. 28 (2022), pp. 154-163
    Preview abstract Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge—the largest histopathology competition to date, joined by 1,290 developers—to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840–0.884) and 0.868 (95% CI, 0.835–0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials. View details
    Artificial intelligence for phase recognition in complex laparoscopic cholecystectomy
    Tomer Golany
    Amit Aides
    Daniel Freedman
    Nadav Avraham Rabani
    Wisam Khoury
    Hanoch Kashtan
    Petachia Reissman
    Surgical Endoscopy (2022)
    Preview abstract Background: The potential role and benefits of AI in surgery has yet to be determined. This study is a first step in developing an AI system for minimizing adverse events and improving patient’s safety. We developed an Artificial Intelligence (AI) algorithm and evaluated its performance in recognizing surgical phases of laparoscopic cholecystectomy (LC) videos spanning a range of complexities. Methods: A set of 371 LC videos with various complexity levels and containing adverse events was collected from five hospitals. Two expert surgeons segmented each video into 10 phases including Calot’s triangle dissection and clipping and cutting. For each video, adverse events were also annotated when present (major bleeding; gallbladder perforation; major bile leakage; and incidental finding) and complexity level (on a scale of 1–5) was also recorded. The dataset was then split in an 80:20 ratio (294 and 77 videos), stratified by complexity, hospital, and adverse events to train and test the AI model, respectively. The AI-surgeon agreement was then compared to the agreement between surgeons. Results: The mean accuracy of the AI model for surgical phase recognition was 89% [95% CI 87.1%, 90.6%], comparable to the mean inter-annotator agreement of 90% [95% CI 89.4%, 90.5%]. The model’s accuracy was inversely associated with procedure complexity, decreasing from 92% (complexity level 1) to 88% (complexity level 3) to 81% (complexity level 5). Conclusion: The AI model successfully identified surgical phases in both simple and complex LC procedures. Further validation and system training is warranted to evaluate its potential applications such as to increase patient safety during surgery. View details
    Retinal fundus photographs capture hemoglobin loss after blood donation
    Akinori Mitani
    Ilana Traynis
    Preeti Singh
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Naama Hammel
    medRxiv (2022)
    Preview abstract Recently it was shown that blood hemoglobin concentration could be predicted from retinal fundus photographs by deep learning models. However, it is unclear whether the models were quantifying current blood hemoglobin level, or estimating based on subjects' pretest probability of having anemia. Here, we conducted an observational study with 14 volunteers who donated blood at an on site blood drive held by the local blood center (ie, at which time approximately 10% of their blood was removed). When the deep learning model was applied to retinal fundus photographs taken before and after blood donation, it detected a decrease in blood hemoglobin concentration within each subject at 2-3 days after donation, suggesting that the model was quantifying subacute hemoglobin changes instead of predicting subjects' risk. Additional randomized or controlled studies can further validate this finding. View details
    Deep Learning Detection of Active Pulmonary Tuberculosis at Chest Radiography Matched the Clinical Performance of Radiologists
    Sahar Kazemzadeh
    Jin Yu
    Shahar Jamshy
    Rory Pilgrim
    Christina Chen
    Neeral Beladia
    Chuck Lau
    Scott Mayer McKinney
    Thad Hughes
    Atilla Peter Kiraly
    Sreenivasa Raju Kalidindi
    Monde Muyoyeta
    Jameson Malemela
    Ting Shih
    Lily Hao Yi Peng
    Kat Chou
    Cameron Chen
    Krish Eswaran
    Shravya Ramesh Shetty
    Radiology (2022)
    Preview abstract Background: The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose: To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods: A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning (“noisy-student”) were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results: A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%–80% per TB-positive patient detected. Conclusion: A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs. View details
    Using generative AI to investigate medical imagery models and datasets
    Boris Babenko
    Charles Lau
    Chloe Nichols
    Christopher Semturs
    Doron Yaya-Stupp
    Heather Cole-Lewis
    Ilana Traynis
    Naama Hammel
    https://arxiv.org/ (2022)
    Preview abstract AI models have shown promise in performing many medical imaging tasks. However, our ability to explain what signals these models learn from the training data is severely lacking. Explanations are needed in order to increase the trust of doctors in AI-based models, especially in domains where AI prediction capabilities surpass those of humans. Moreover, such explanations could enable novel scientific discovery by uncovering signals in the data that aren’t yet known to experts. In this paper, we present a method for automatic visual explanations that can help achieve these goals by generating hypotheses of what visual signals in the images are correlated with the task. We propose the following 4 steps: (i) Train a classifier to perform a given task to assess whether the imagery indeed contains signals relevant to the task; (ii) Train a StyleGAN-based image generator with an architecture that enables guidance by the classifier (“StylEx”); (iii) Automatically detect and extract the top visual attributes that the classifier is sensitive to. Each of these attributes can then be independently modified for a set of images to generate counterfactual visualizations of those attributes (i.e. what that image would look like with the attribute increased or decreased); (iv) Present the discovered attributes and corresponding counterfactual visualizations to a multidisciplinary panel of experts to formulate hypotheses for the underlying mechanisms with consideration to social and structural determinants of health (e.g. whether the attributes correspond to known patho-physiological or socio-cultural phenomena, or could be novel discoveries) and stimulate future research. To demonstrate the broad applicability of our approach, we demonstrate results on eight prediction tasks across three medical imaging modalities – retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples where many of the automatically-learned attributes clearly capture clinically known features (e.g., types of cataract, enlarged heart), and demonstrate automatically-learned confounders that arise from factors beyond physiological mechanisms (e.g., chest X-ray underexposure is correlated with the classifier predicting abnormality, and eye makeup is correlated with the classifier predicting low hemoglobin levels). We further show that our method reveals a number of physiologically plausible novel attributes for future investigation (e.g., differences in the fundus associated with self-reported sex, which were previously unknown). While our approach is not able to discern causal pathways, the ability to generate hypotheses from the attribute visualizations has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models. Importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors, and hence multidisciplinary perspectives are critical in these investigations. Finally, we release code to enable researchers to train their own StylEx models and analyze their predictive tasks of interest, and use the methodology presented in this paper for responsible interpretation of the revealed attributes. View details
    Deep learning models for histologic grading of breast cancer and association with disease prognosis
    Trissia Brown
    Isabelle Flament
    Fraser Tan
    Yuannan Cai
    Kunal Nagpal
    Emad Rakha
    David J. Dabbs
    Niels Olson
    James H. Wren
    Elaine E. Thompson
    Erik Seetao
    Carrie Robinson
    Melissa Miao
    Fabien Beckers
    Lily Hao Yi Peng
    Craig Mermel
    Cameron Chen
    npj Breast Cancer (2022)
    Preview abstract Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we developed deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We then evaluated the prognostic potential of these models using an external test set and progression free interval as the primary outcome. The individual component models performed at or above published benchmarks for algorithm-based grading approaches and achieved high concordance rates in comparison to pathologist grading. Prognostic performance of histologic scoring provided by the deep learning-based grading was on par with that of pathologists performing review of matched slides. Additionally, by providing scores for each component feature, the deep-learning based approach provided the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models as well as opportunities to improve access to consistent grading and better understand the links between histologic features and clinical outcomes in breast cancer. View details
    Prospective validation of smartphone-based heart rate and respiratory rate measurement algorithms
    Sean K Bae
    Yunus Emre
    Jonathan Wang
    Jiang Wu
    Mehr Kashyap
    Si-Hyuck Kang
    Liwen Chen
    Melissa Moran
    Julie Cannon
    Eric Steven Teasley
    Allen Chai
    Neal Wadhwa
    Alejandra Maciel
    Mike McConnell
    Shwetak Patel
    Jim Taylor
    Jiening Zhan
    Ming Po
    Nature Communications Medicine (2022)
    Preview abstract Background: Measuring vital signs plays a key role in both patient care and wellness, but can be challenging outside of medical settings due to the lack of specialized equipment. Methods: In this study, we prospectively evaluated smartphone camera-based techniques for measuring heart rate (HR) and respiratory rate (RR) for consumer wellness use. HR was measured by placing the finger over the rear-facing camera, while RR was measured via a video of the participants sitting still in front of the front-facing camera. Results: In the HR study of 95 participants (with a protocol that included both measurements at rest and post exercise), the mean absolute percent error (MAPE) ± standard deviation of the measurement was 1.6% ± 4.3%, which was significantly lower than the pre-specified goal of 5%. No significant differences in the MAPE were present across colorimeter-measured skin-tone subgroups: 1.8% ± 4.5% for very light to intermediate, 1.3% ± 3.3% for tan and brown, and 1.8% ± 4.9% for dark. In the RR study of 50 participants, the mean absolute error (MAE) was 0.78 ± 0.61 breaths/min, which was significantly lower than the pre-specified goal of 3 breaths/min. The MAE was low in both healthy participants (0.70 ± 0.67 breaths/min), and participants with chronic respiratory conditions (0.80 ± 0.60 breaths/min). Conclusions: These results validate the accuracy of our smartphone camera-based techniques to measure HR and RR across a range of pre-defined subgroups. View details
    A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
    Ryan Gomes
    Bellington Vwalika
    Chace Lee
    Angelica Willis
    Joan T. Price
    Christina Chen
    Margaret P. Kasaro
    James A. Taylor
    Elizabeth M. Stringer
    Scott Mayer McKinney
    Ntazana Sindano
    George Edward Dahl
    William Goodnight, III
    Justin Gilmer
    Benjamin H. Chi
    Charles Lau
    Terry Spitz
    Kris Liu
    Jonny Wong
    Rory Pilgrim
    Akib Uddin
    Lily Hao Yi Peng
    Kat Chou
    Jeffrey S. A. Stringer
    Shravya Ramesh Shetty
    Communications Medicine (2022)
    Preview abstract Background Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep. Conclusions The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings. View details
    Detection of signs of disease in external photographs of the eyes via deep learning
    Boris Babenko
    Akinori Mitani
    Ilana Traynis
    Naho Kitade
    Preeti Singh
    April Maa
    Jorge Cuadros
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Naama Hammel
    Nature Biomedical Engineering (2022)
    Preview abstract Retinal fundus photographs can be used to detect a range of retinal conditions. Here we show that deep-learning models trained instead on external photographs of the eyes can be used to detect diabetic retinopathy (DR), diabetic macular oedema and poor blood glucose control. We developed the models using eye photographs from 145,832 patients with diabetes from 301 DR screening sites and evaluated the models on four tasks and four validation datasets with a total of 48,644 patients from 198 additional screening sites. For all four tasks, the predictive performance of the deep-learning models was significantly higher than the performance of logistic regression models using self-reported demographic and medical history data, and the predictions generalized to patients with dilated pupils, to patients from a different DR screening programme and to a general eye care programme that included diabetics and non-diabetics. We also explored the use of the deep-learning models for the detection of elevated lipid levels. The utility of external eye photographs for the diagnosis and management of diseases should be further validated with images from different cameras and patient populations. View details
    Detection of Elusive Polyps via a Large Scale AI System
    Dan Livovsky
    Danny Veikherman
    Tomer Golany
    Amit Aides
    Valentin Dashinsky
    Nadav Rabani
    David Ben Shimol
    Yochai Blau
    Ilan Moshe Shimshoni
    Ori Segol
    Eran Goldin
    Jesse Lachter
    Daniel Freedman
    Gastrointestinal Endoscopy (2021)
    Preview abstract Colorectal cancer (CRC) is the second leading cause of cancer death worldwide resulting in an estimated 900,000 deaths per year. Colonoscopy is the gold standard for detection and removal of precancerous lesions, and has been amply shown to reduce mortality. However, the miss rate for polyps during colonoscopies is 22-28%, while 20-24% of the missed lesions are histologically confirmed adenomas. To address this shortcoming, we propose a polyp detection system based on deep learning, which can alert the operator in real-time to the presence and location of polyps during a colonoscopy. We dub the system DEEP^2: DEEP DEtection of ElusivePolyps. The DEEP^2 system was trained on 3,611 hours of colonoscopy videos derived from two sources, and was validated on a set comprising 1,393 hours of video, coming from a third, unrelated source. For the validation set, the ground truth labelling was provided by offline GI annotators, who were able to watch the video in slow-motion and pause/rewind as required; two or three such annotators examined each video. Overall, DEEP^2 achieves a sensitivity of 96.8% at 4.9 false alarms per video, which improves substantially on the current state of the art. These results are attained using a neural network architecture which is designed to provide fast computations, and can therefore run in real-time at greater than 30 frames per second. We further analyze the data by examining its performance on elusive polyps, those polyps which are particularly difficult for endoscopists to detect. First, we show that on fast polyps that are in the field of view for less than 5 seconds, DEEP^2 attains a sensitivity of 88.5%, compared to a sensitivity of 31.7% for the endoscopists performing the procedure. On even shorter duration polyps, those that are in the field of view for less than 2 seconds, the difference is even starker: DEEP^2 attains a sensitivity of 84.9% vs. 18.9% for the endoscopists. Second, we examine procedures which are apparently clean, in that no polyps are detected by either the performing endoscopist or the offline annotators. In these sequences, DEEP^2 is able to detect polyps -- not seen by either live endoscopists or offline annotators -- which were later verified to be real polyps: an average of 0.22 polyps per sequence, of which 0.10 are adenomas. Finally, a preliminary small clinical validation indicates that the system will be useful in practice: on 32 procedures, DEEP^2 discovered an average of 1.06 polyps per procedure that would have otherwise been missed by the GI performing the procedure. Future work will be needed to measure the clinical impact on a larger scale. View details
    Does Your Dermatology Classifier Know What It Doesn't Know? Detecting the Long-Tail of Unseen Conditions
    Aaron Loh
    Basil Mustafa
    Nick Pawlowski
    Jan Freyberg
    Yuan Liu
    Zach William Beaver
    Nam Vo
    Peggy Bui
    Samantha Winter
    Patricia MacWilliams
    Umesh Telang
    Taylan Cemgil
    Medical Imaging Analysis (2021)
    Preview abstract Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To avoid models generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent `outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between model train, validation, and test sets. Unlike most traditional OOD benchmarks which detect dataset distribution shift, we aim at detecting semantic differences, often referred to as near-OOD detection which is a more difficult task. We propose a novel hierarchical outlier detection (HOD) approach, which assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers \vs{} outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD outperforms existing techniques for outlier exposure based OOD detection. We also use different state-of-the-art representation learning approaches (BiT-JFT, SimCLR, MICLe) to improve OOD performance and demonstrate the effectiveness of HOD loss for them. Further, we explore different ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also performed a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrated the gains of our framework in comparison to baselines. Furthermore, we go beyond traditional performance metrics and introduce a cost metric to approximate downstream clinical impact. We used this cost metric to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world deployment scenarios. View details
    Determining Breast Cancer Biomarker Status and Associated Morphological Features Using Deep Learning
    Paul Gamble
    Harry Wang
    Fraser Tan
    Melissa Moran
    Trissia Brown
    Isabelle Flament
    Emad A. Rakha
    Michael Toss
    David J. Dabbs
    Peter Regitnig
    Niels Olson
    James H. Wren
    Carrie Robinson
    Lily Peng
    Craig Mermel
    Cameron Chen
    Nature Communications Medicine (2021)
    Preview abstract Background: Breast cancer management depends on biomarkers including estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (ER/PR/HER2). Though existing scoring systems are widely used and well-validated, they can involve costly preparation and variable interpretation. Additionally, discordances between histology and expected biomarker findings can prompt repeat testing to address biological, interpretative, or technical reasons for unexpected results. Methods: We developed three independent deep learning systems (DLS) to directly predict ER/PR/HER2 status for both focal tissue regions (patches) and slides using hematoxylin-andeosin-stained (H&E) images as input. Models were trained and evaluated using pathologist annotated slides from three data sources. Areas under the receiver operator characteristic curve (AUCs) were calculated for test sets at both a patch-level (>135 million patches, 181 slides) and slide-level (n = 3274 slides, 1249 cases, 37 sites). Interpretability analyses were performed using Testing with Concept Activation Vectors (TCAV), saliency analysis, and pathologist review of clustered patches. Results: The patch-level AUCs are 0.939 (95%CI 0.936–0.941), 0.938 (0.936–0.940), and 0.808 (0.802–0.813) for ER/PR/HER2, respectively. At the slide level, AUCs are 0.86 (95% CI 0.84–0.87), 0.75 (0.73–0.77), and 0.60 (0.56–0.64) for ER/PR/HER2, respectively. Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining. Conclusions: This study presents rapid breast cancer biomarker estimation from routine H&E slides and builds on prior advances by prioritizing interpretability of computationally learned features in the context of existing pathological knowledge. View details
    Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19
    Shahar Jamshy
    Charles Lau
    Eddie Santos
    Atilla Peter Kiraly
    Jie Yang
    Rory Pilgrim
    Sahar Kazemzadeh
    Jin Yu
    Lily Hao Yi Peng
    Krish Eswaran
    Neeral Beladia
    Cameron Chen
    Shravya Ramesh Shetty
    Scientific Reports (2021)
    Preview abstract Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to detect every possible condition by building multiple separate systems, each of which detects one or more pre-specified conditions. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For training and tuning the system, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system trained using a large dataset containing a diverse array of CXR abnormalities generalizes to new patient populations and unseen diseases. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7–28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist. Lastly, to facilitate the continued development of AI models for CXR, we release our collected labels for the publicly available dataset. View details
    Vaccine Search Patterns Provide Insights into Vaccination Intent
    Sean Malahy
    Keith Spangler
    Jessica Leibler
    Kevin J. Lane
    Shailesh Bavadekar
    Chaitanya Kamath
    Akim Kumok
    Yuantong Sun
    Tague Griffith
    Adam Boulanger
    Mark Young
    Charlotte Stanton
    Yael Mayer
    Karen Lee Smith
    Kat Chou
    Jonathan I. Levy
    Adam A.Szpiro
    Evgeniy Gabrilovich
    Gregory A. Wellenius
    arXiv (2021), TBD
    Preview abstract Despite ample supply of COVID-19 vaccines, the proportion of fully vaccinated individuals remains suboptimal across much of the US. Rapid vaccination of additional people will prevent new infections among both the unvaccinated and the vaccinated, thus saving lives. With the rapid rollout of vaccination efforts this year, the internet has become a dominant source of information about COVID-19 vaccines, their safety and efficacy, and their availability. We sought to evaluate whether trends in internet searches related to COVID-19 vaccination - as reflected by Google's Vaccine Search Insights (VSI) index - could be used as a marker of population-level interest in receiving a vaccination. We found that between January and August of 2021: 1) Google's weekly VSI index was associated with the number of new vaccinations administered in the subsequent three weeks, and 2) the average VSI index in earlier months was strongly correlated (up to r = 0.89) with vaccination rates many months later. Given these results, we illustrate an approach by which data on search interest may be combined with other available data to inform local public health outreach and vaccination efforts. These results suggest that the VSI index may be useful as a leading indicator of population-level interest in or intent to obtain a COVID-19 vaccine, especially early in the vaccine deployment efforts. These results may be relevant to current efforts to administer COVID-19 vaccines to unvaccinated individuals, to newly eligible children, and to those eligible to receive a booster shot. More broadly, these results highlight the opportunities for anonymized and aggregated internet search data, available in near real-time, to inform the response to public health emergencies. View details
    A.I.-based Gleason Grading for Stratification of Prostate Cancer Outcomes
    Kunal Nagpal
    Matthew Symonds
    Melissa Moran
    Markus Plass
    Robert Reihs
    Farah Nader
    Fraser Tan
    Yuannan Cai
    Trissia Brown
    Isabelle Flament
    Mahul Amin
    Martin Stumpe
    Heimo Muller
    Peter Regitnig
    Andreas Holzinger
    Lily Hao Yi Peng
    Cameron Chen
    Kurt Zatloukal
    Craig Mermel
    Communications Medicine (2021)
    Preview abstract Background. Gleason grading of prostate cancer is an important prognostic factor, but suffers from poor reproducibility, particularly among non-subspecialist pathologists. Although artificial intelligence (A.I.) tools have demonstrated Gleason grading on-par with expert pathologists, it remains an open question whether and to what extent A.I. grading translates to better prognostication. Methods. In this study, we developed a system to predict prostate cancer-specific mortality via A.I.-based Gleason grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5–25 years of follow-up (median: 13, interquartile range 9–17). Results. Here, we show that the A.I.’s risk scores produced a C-index of 0.84 (95% CI 0.80–0.87) for prostate cancer-specific mortality. Upon discretizing these risk scores into risk groups analogous to pathologist Grade Groups (GG), the A.I. has a C-index of 0.82 (95% CI 0.78–0.85). On the subset of cases with a GG provided in the original pathology report (n = 1517), the A.I.’s C-indices are 0.87 and 0.85 for continuous and discrete grading, respectively, compared to 0.79 (95% CI 0.71–0.86) for GG obtained from the reports. These represent improvements of 0.08 (95% CI 0.01–0.15) and 0.07 (95% CI 0.00–0.14), respectively. Conclusions. Our results suggest that A.I.-based Gleason grading can lead to effective risk stratification, and warrants further evaluation for improving disease management. View details
    Development and Assessment of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices
    Ayush Jain
    David Way
    Vishakha Gupta
    Yi Gao
    Guilherme De Oliveira Marinho
    Jay David Hartford
    Rory Abbott Sayres
    Kimberly Kanada
    Clara Eng
    Kunal Nagpal
    Lily Hao Yi Peng
    Carter Dunn
    Susan Jen Huang
    Peggy Bui
    Yuan Liu
    JAMA Network Open (2021)
    Preview abstract Importance: Most dermatologic cases are initially evaluated by nondermatologists such as primary care physicians (PCPs) or nurse practitioners (NPs). Objective: To evaluate an artificial intelligence (AI)–based tool that assists with diagnoses of dermatologic conditions. Design, Setting, and Participants: This multiple-reader, multiple-case diagnostic study developed an AI-based tool and evaluated its utility. Primary care physicians and NPs retrospectively reviewed an enriched set of cases representing 120 different skin conditions. Randomization was used to ensure each clinician reviewed each case either with or without AI assistance; each clinician alternated between batches of 50 cases in each modality. The reviews occurred from February 21 to April 28, 2020. Data were analyzed from May 26, 2020, to January 27, 2021. Exposures: An AI-based assistive tool for interpreting clinical images and associated medical history. Main Outcomes and Measures: The primary analysis evaluated agreement with reference diagnoses provided by a panel of 3 dermatologists for PCPs and NPs. Secondary analyses included diagnostic accuracy for biopsy-confirmed cases, biopsy and referral rates, review time, and diagnostic confidence. Results: Forty board-certified clinicians, including 20 PCPs (14 women [70.0%]; mean experience, 11.3 [range, 2-32] years) and 20 NPs (18 women [90.0%]; mean experience, 13.1 [range, 2-34] years) reviewed 1048 retrospective cases (672 female [64.2%]; median age, 43 [interquartile range, 30-56] years; 41 920 total reviews) from a teledermatology practice serving 11 sites and provided 0 to 5 differential diagnoses per case (mean [SD], 1.6 [0.7]). The PCPs were located across 12 states, and the NPs practiced in primary care without physician supervision across 9 states. The NPs had a mean of 13.1 (range, 2-34) years of experience and practiced in primary care without physician supervision across 9 states. Artificial intelligence assistance was significantly associated with higher agreement with reference diagnoses. For PCPs, the increase in diagnostic agreement was 10% (95% CI, 8%-11%; P < .001), from 48% to 58%; for NPs, the increase was 12% (95% CI, 10%-14%; P < .001), from 46% to 58%. In secondary analyses, agreement with biopsy-obtained diagnosis categories of maglignant, precancerous, or benign increased by 3% (95% CI, −1% to 7%) for PCPs and by 8% (95% CI, 3%-13%) for NPs. Rates of desire for biopsies decreased by 1% (95% CI, 0-3%) for PCPs and 2% (95% CI, 1%-3%) for NPs; the rate of desire for referrals decreased by 3% (95% CI, 1%-4%) for PCPs and NPs. Diagnostic agreement on cases not indicated for a dermatologist referral increased by 10% (95% CI, 8%-12%) for PCPs and 12% (95% CI, 10%-14%) for NPs, and median review time increased slightly by 5 (95% CI, 0-8) seconds for PCPs and 7 (95% CI, 5-10) seconds for NPs per case. Conclusions and Relevance: Artificial intelligence assistance was associated with improved diagnoses by PCPs and NPs for 1 in every 8 to 10 cases, indicating potential for improving the quality of dermatologic care. View details
    Machine learning for clinical operations improvement via case triaging
    Susan Jen Huang
    Kimberly Kanada
    Lily Hao Yi Peng
    Peggy Bui
    Yuan Liu
    Skin Health and Disease (2021)
    Preview abstract In recent years, an increasing number of machine learning (ML) models have been developed for interpreting images of skin conditions and for risk stratification. Beyond accurate image interpretation, one potential application of these interpretations may be triaging systems to help direct care to the right care provider at the right time. This is a critical need because dermatologist appointment wait times exceed a month in many regions, a trend that can potentially be alleviated by rapidly stratifying patients to clinicians with the appropriate level of training (e.g., board-certified dermatologist, advanced practice provider under dermatologist supervision, non-dermatologist) and the appropriate urgency. To help understand ML's potential for this triaging, we analysed a previously-described deep learning system (DLS) that provides a differential diagnosis of teledermatology cases and that improved the diagnostic accuracy of primary care physicians and nurse practitioners in a randomized study. We reordered the cases within each ‘review batch’ of 500 based on the urgency category of the DLS-predicted skin condition (which is an automated process requiring no human intervention). On average, this caused the review order of urgent cases to be prioritised substantially sooner than that of less urgent cases, with the average rank of ‘immediate intervention cases’ being about 100 (vs. 253 without reordering, p < 0.001), and that of ‘no need to see a doctor’ cases being close to 400 (vs. 252 without reordering, p < 0.001). Our approach has the potential to accelerate triaging and reduce the burden on the limited dermatology workforce to focus on patient management. View details
    Interpretable Survival Prediction for Colorectal Cancer using Deep Learning
    Melissa Moran
    Markus Plass
    Robert Reihs
    Fraser Tan
    Isabelle Flament
    Trissia Brown
    Peter Regitnig
    Cameron Chen
    Apaar Sadhwani
    Bob MacDonald
    Benny Ayalew
    Lily Hao Yi Peng
    Heimo Mueller
    Zhaoyang Xu
    Martin Stumpe
    Kurt Zatloukal
    Craig Mermel
    npj Digital Medicine (2021)
    Preview abstract Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies. View details
    Early social distancing policies in Europe, changes in mobility & COVID-19 case trajectories: Insights from Spring 2020
    Liana R. Woskie
    Jonathan Hennessy
    Valeria Espinosa
    Thomas Tsai
    Swapnil Vispute
    Ciro Cattuto
    Laetitia Gauvin
    Michele Tizzoni
    Krishna Gadepalli
    Adam Boulanger
    Adam Pearce
    Chaitanya Kamath
    Arran Schlosberg
    Charlotte Stanton
    Shailesh Bavadekar
    Matthew Abueg
    Michael Hogue
    Andrew Oplinger
    Katherine Chou
    Ashish K. Jha
    Greg Wellenius
    Evgeniy Gabrilovich
    PLOS ONE (2021)
    Preview abstract Background: Social distancing have been widely used to mitigate community spread of SARS-CoV-2. We sought to quantify the impact of COVID-19 social distancing policies across 27 European counties in spring 2020 on population mobility and the subsequent trajectory of disease. Methods: We obtained data on national social distancing policies from the Oxford COVID-19 Government Response Tracker and aggregated and anonymized mobility data from Google. We used a pre-post comparison and two linear mixed-effects models to first assess the relationship between implementation of national policies and observed changes in mobility, and then to assess the relationship between changes in mobility and rates of COVID-19 infections in subsequent weeks. Results: Compared to a pre-COVID baseline, Spain saw the largest decrease in aggregate population mobility (~70%), as measured by the time spent away from residence, while Sweden saw the smallest decrease (~20%). The largest declines in mobility were associated with mandatory stay-at-home orders, followed by mandatory workplace closures, school closures, and non-mandatory workplace closures. While mandatory shelter-in-place orders were associated with 16.7% less mobility (95% CI: -23.7% to -9.7%), non-mandatory orders were only associated with an 8.4% decrease (95% CI: -14.9% to -1.8%). Large-gathering bans were associated with the smallest change in mobility compared with other policy types. Changes in mobility were in turn associated with changes in COVID-19 case growth. For example, a 10% decrease in time spent away from places of residence was associated with 11.8% (95% CI: 3.8%, 19.1%) fewer new COVID-19 cases. Discussion: This comprehensive evaluation across Europe suggests that mandatory stay-at-home orders and workplace closures had the largest impacts on population mobility and subsequent COVID-19 cases at the onset of the pandemic. With a better understanding of policies’ relative performance, countries can more effectively invest in, and target, early nonpharmacological interventions. View details
    International evaluation of an AI system for breast cancer screening
    Scott Mayer McKinney
    Varun Yatindra Godbole
    Jonathan Godwin
    Natasha Antropova
    Hutan Ashrafian
    Trevor John Back
    Mary Chesus
    Ara Darzi
    Mozziyar Etemadi
    Florencia Garcia-Vicente
    Fiona J Gilbert
    Mark D Halling-Brown
    Demis Hassabis
    Sunny Jansen
    Dominic King
    David Melnick
    Hormuz Mostofi
    Lily Hao Yi Peng
    Joshua Reicher
    Bernardino Romera Paredes
    Richard Sidebottom
    Mustafa Suleyman
    Kenneth C. Young
    Jeffrey De Fauw
    Shravya Ramesh Shetty
    Nature (2020)
    Preview abstract Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening. View details
    Detecting Deficient Coverage in Colonoscopies
    Amit Aides
    Ariel Gordon
    Daniel Freedman
    Danny Veikherman
    Ilan Moshe Shimshoni
    Tomer Golany
    Yochai Blau
    IEEE Transactions on Medical Imaging (2020)
    Preview abstract Colorectal Cancer (CRC) is a global health problem, resulting in 900K deaths per year. Colonoscopy is the tool of choice for preventing CRC, by detecting polyps before they become cancerous, and removing them. However, colonoscopy is hampered by the fact that endoscopists routinely miss an average of 22-28% of polyps. While some of these missed polyps appear in the endoscopist's field of view, others are missed simply because of substandard coverage of the procedure, i.e. not all of the colon is seen. This paper attempts to rectify the problem of substandard coverage in colonoscopy through the introduction of the C2D2 (Colonoscopy Coverage Deficiency via Depth) algorithm which detects deficient coverage, and can thereby alert the endoscopist to revisit a given area. More specifically, C2D2 consists of two separate algorithms: the first performs depth estimation of the colon given an ordinary RGB video stream; while the second computes coverage given these depth estimates. Rather than compute coverage for the entire colon, our algorithm computes coverage locally, on a segment-by-segment basis; C2D2 can then indicate in real-time whether a particular area of the colon has suffered from deficient coverage, and if so the endoscopist can return to that area. Our coverage algorithm is the first such algorithm to be evaluated in a large-scale way; while our depth estimation technique is the first calibration-free unsupervised method applied to colonoscopies. The C2D2 algorithm achieves state of the art results in the detection of deficient coverage: it is 2.4 times more accurate than human experts. View details
    Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer From Biopsy Specimens
    Kunal Nagpal
    Davis Foote
    Fraser Tan
    Cameron Chen
    Naren Manoj
    Niels Olson
    Jenny Smith
    Arash Mohtashamian
    Brandon Peterson
    Mahul Amin
    Andrew Evans
    Joan Sweet
    Carol Cheung
    Theodorus van der Kwast
    Ankur Sangoi
    Ming Zhou
    Robert W. Allan
    Peter A Humphrey
    Jason Hipp
    Krishna Kumar Gadepalli
    Lily Hao Yi Peng
    Martin Stumpe
    Craig Mermel
    JAMA Oncology (2020)
    Preview abstract Importance: For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools to improve the reproducibility of Gleason grading in routine clinical practice. Objective: To evaluate the ability of a deep learning system (DLS) to grade diagnostic prostate biopsy specimens. Design, Setting, and Participants: The DLS was evaluated using 752 deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens obtained from 3 institutions in the United States, including 1 institution not used for DLS development. To obtain the Gleason grade group (GG), each specimen was first reviewed by 2 expert urologic subspecialists from a multi-institutional panel of 6 individuals (years of experience: mean, 25 years; range, 18-34 years). A third subspecialist reviewed discordant cases to arrive at a majority opinion. To reduce diagnostic uncertainty, all subspecialists had access to an immunohistochemical-stained section and 3 histologic sections for every biopsied specimen. Their review was conducted from December 2018 to June 2019. Main Outcomes and Measures: The frequency of the exact agreement of the DLS with the majority opinion of the subspecialists in categorizing each tumor-containing specimen as 1 of 5 categories: nontumor, GG1, GG2, GG3, or GG4-5. For comparison, the rate of agreement of 19 general pathologists’ opinions with the subspecialists’ majority opinions was also evaluated. Results: For grading tumor-containing biopsy specimens in the validation set (n = 498), the rate of agreement with subspecialists was significantly higher for the DLS (71.7%; 95% CI, 67.9%-75.3%) than for general pathologists (58.0%; 95% CI, 54.5%-61.4%) (P < .001). In subanalyses of biopsy specimens from an external validation set (n = 322), the Gleason grading performance of the DLS remained similar. For distinguishing nontumor from tumor-containing biopsy specimens (n = 752), the rate of agreement with subspecialists was 94.3% (95% CI, 92.4%-95.9%) for the DLS and similar at 94.7% (95% CI, 92.8%-96.3%) for general pathologists (P = .58). Conclusions and Relevance: In this study, the DLS showed higher proficiency than general pathologists at Gleason grading prostate needle core biopsy specimens and generalized to an independent institution. Future research is necessary to evaluate the potential utility of using the DLS as a decision support tool in clinical workflows and to improve the quality of prostate cancer grading for therapy decisions. View details
    Customization Scenarios for De-identification of Clinical Notes
    Danny Vainstein
    Gavin Edward Bee
    Jack Po
    Jutta Williams
    Kat Chou
    Ronit Yael Slyper
    Rony Amira
    Shlomo Hoory
    Tzvika Hartman
    BMC Medical Informatics and Decision Making (2020)
    Preview abstract Background: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. Objective: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. Methods: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. Results: Fully customized systems remove 97-99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. Conclusion: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level. View details
    Predicting the risk of developing diabetic retinopathy using deep learning
    Ashish Bora
    Siva Balasubramanian
    Boris Babenko
    Sunny Virmani
    Subhashini Venugopalan
    Akinori Mitani
    Guilherme De Oliveira Marinho
    Jorge Cuadros
    Dr. Paisan Raumviboonsuk
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Naama Hammel
    Lancet Digital Health (2020)
    Preview abstract Background: Diabetic retinopathy screening is instrumental to preventing blindness, but scaling up screening is challenging because of the increasing number of patients with all forms of diabetes. We aimed to create a deep-learning system to predict the risk of patients with diabetes developing diabetic retinopathy within 2 years. Methods: We created and validated two versions of a deep-learning system to predict the development of diabetic retinopathy in patients with diabetes who had had teleretinal diabetic retinopathy screening in a primary care setting. The input for the two versions was either a set of three-field or one-field colour fundus photographs. Of the 575 431 eyes in the development set 28 899 had known outcomes, with the remaining 546 532 eyes used to augment the training process via multitask learning. Validation was done on one eye (selected at random) per patient from two datasets: an internal validation (from EyePACS, a teleretinal screening service in the USA) set of 3678 eyes with known outcomes and an external validation (from Thailand) set of 2345 eyes with known outcomes. Findings: The three-field deep-learning system had an area under the receiver operating characteristic curve (AUC) of 0·79 (95% CI 0·77–0·81) in the internal validation set. Assessment of the external validation set—which contained only one-field colour fundus photographs—with the one-field deep-learning system gave an AUC of 0·70 (0·67–0·74). In the internal validation set, the AUC of available risk factors was 0·72 (0·68–0·76), which improved to 0·81 (0·77–0·84) after combining the deep-learning system with these risk factors (p<0·0001). In the external validation set, the corresponding AUC improved from 0·62 (0·58–0·66) to 0·71 (0·68–0·75; p<0·0001) following the addition of the deep-learning system to available risk factors. Interpretation: The deep-learning systems predicted diabetic retinopathy development using colour fundus photographs, and the systems were independent of and more informative than available risk factors. Such a risk stratification tool might help to optimise screening intervals to reduce costs while improving vision-related outcomes. View details
    Evaluation of the Use of Combined Artificial Intelligence and Pathologist Assessment to Review and Grade Prostate Biopsies
    Kunal Nagpal
    Rory Abbott Sayres
    Davis J. Foote
    Adam Pearce
    Samantha Winter
    Matthew Symonds
    Liron Yatziv
    Trissia Brown
    Isabelle Flament-Auvigne
    Fraser Tan
    Martin C. Stumpe
    Cameron Chen
    Craig Mermel
    JAMA Network Open (2020)
    Preview abstract Importance: Expert-level artificial intelligence (AI) algorithms for prostate biopsy grading have recently been developed. However, the potential impact of integrating such algorithms into pathologist workflows remains largely unexplored. Objective: To evaluate an expert-level AI-based assistive tool when used by pathologists for the grading of prostate biopsies. Design, Setting, and Participants: This diagnostic study used a fully crossed multiple-reader, multiple-case design to evaluate an AI-based assistive tool for prostate biopsy grading. Retrospective grading of prostate core needle biopsies from 2 independent medical laboratories in the US was performed between October 2019 and January 2020. A total of 20 general pathologists reviewed 240 prostate core needle biopsies from 240 patients. Each pathologist was randomized to 1 of 2 study cohorts. The 2 cohorts reviewed every case in the opposite modality (with AI assistance vs without AI assistance) to each other, with the modality switching after every 10 cases. After a minimum 4-week washout period for each batch, the pathologists reviewed the cases for a second time using the opposite modality. The pathologist-provided grade group for each biopsy was compared with the majority opinion of urologic pathology subspecialists. Exposure: An AI-based assistive tool for Gleason grading of prostate biopsies. Main Outcomes and Measures: Agreement between pathologists and subspecialists with and without the use of an AI-based assistive tool for the grading of all prostate biopsies and Gleason grade group 1 biopsies. Results: Biopsies from 240 patients (median age, 67 years; range, 39-91 years) with a median prostate-specific antigen level of 6.5 ng/mL (range, 0.6-97.0 ng/mL) were included in the analyses. Artificial intelligence–assisted review by pathologists was associated with a 5.6% increase (95% CI, 3.2%-7.9%; P < .001) in agreement with subspecialists (from 69.7% for unassisted reviews to 75.3% for assisted reviews) across all biopsies and a 6.2% increase (95% CI, 2.7%-9.8%; P = .001) in agreement with subspecialists (from 72.3% for unassisted reviews to 78.5% for assisted reviews) for grade group 1 biopsies. A secondary analysis indicated that AI assistance was also associated with improvements in tumor detection, mean review time, mean self-reported confidence, and interpathologist agreement. Conclusions and Relevance: In this study, the use of an AI-based assistive tool for the review of prostate biopsies was associated with improvements in the quality, efficiency, and consistency of cancer detection and grading. View details
    Predicting OCT-derived DME grades from fundus photographs using deep learning
    Arunachalam Narayanaswamy
    Avinash Vaidyanathan Varadarajan
    Dr. Paisan Raumviboonsuk
    Dr. Peranut Chotcomwongse
    Jorge Cuadros
    Lily Hao Yi Peng
    Pearse Keane
    Subhashini Venugopalan
    Nature Communications (2020)
    Preview abstract Diabetic eye disease is one of the fastest growing causes of preventable blindness. With the advent of anti-VEGF therapies, it has become increasingly important to detect center-involved DME (ci-DME). However, ci-DME is diagnosed using optical coherence tomography (OCT), which is not generally available at screening sites. Instead, screening programs rely on the detection of hard exudates as a proxy for DME on color fundus photographs, but this often results in a fair number of false positive and false negative calls. We trained a deep learning model to use color fundus images to directly predict grades derived from OCT exams for DME. Our OCT-based model had an AUC of 0.89 (95% CI: 0.87-0.91), which corresponds to a sensitivity of 85% at a specificity of 80%. In comparison, the ophthalmology graders had sensitivities ranging from 82%-85% and specificities ranging from 44%-50%. These metrics correspond to a PPV of 61% (95% CI: 56%-66%) for the OCT-based algorithm and a range of 36-38% (95% CI ranging from 33% -42%) for ophthalmologists. In addition, we used multiple attention techniques to explain how the model is making its prediction. The ability of deep learning algorithms to make clinically relevant predictions that generally requires sophisticated 3D-imaging equipment from simple 2D images has broad relevance to many other applications in medical imaging. View details
    Scientific Discovery by Generating Counterfactuals using Image Translation
    Arununachalam Narayanaswamy
    Subhashini Venugopalan
    Lily Hao Yi Peng
    Dr. Paisan Raumviboonsuk
    Avinash Vaidyanathan Varadarajan
    Proceedings of MICCAI, International Conference on Medical Image Computing and Computer-Assisted Intervention (2020)
    Preview abstract Visual recognition models are increasingly applied toscientific domains such as, drug studies and medical diag-noses, and model explanation techniques play a critical rolein understanding the source of a model’s performance andmaking its decisions transparent. In this work we investi-gate if explanation techniques can also be used as a mech-anism for scientific discovery. We make two contributions,first we propose a framework to convert predictions from ex-planation techniques to a mechanism of discovery. Secondwe show how generative models in combination with black-box predictors can be used to generate hypotheses (withouthuman priors) that can be critically examined. With thesetechniques we study classification models on retinal fundusimages predicting Diabetic Macular Edema (DME). Essen-tially deep convolutional models on 2D retinal fundus im-ages can do nearly as well as ophthalmologists looking at3D scans, making this an interesting case study of clinicalrelevance. Our work highlights that while existing expla-nation tools are useful, they do not necessarily provide acomplete answer. With the proposed framework we are ableto bridge the gap between model’s performance and humanunderstanding of the underlying mechanism which is of vi-tal scientific interest. View details
    Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs
    Sonia Phene
    Carter Dunn
    Naama Hammel
    Jonathan Krause
    Naho Kitade
    Mike Schaekermann
    Rory Abbott Sayres
    Derek Wu
    Ashish Bora
    Christopher Semturs
    Anita Misra
    Abigail Huang
    Arielle Spitze
    Felipe Medeiros
    April Maa
    Monica Gandhi
    Lily Peng
    Ophthalmology (2019)
    Preview abstract Purpose To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. Design Development and validation of an algorithm. Participants Fundus images from screening programs, studies, and a glaucoma clinic. Methods A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. Main Outcome Measures The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. Results The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. Conclusions A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup. View details
    Performance of a Deep-Learning Algorithm vs Manual Grading for Detecting Diabetic Retinopathy in India
    Renu P. Rajan
    Derek Wu
    Peter Wubbels
    Tyler Rhodes
    Kira Whitehouse
    Ramasamy Kim
    Rajiv Raman
    Lily Peng
    JAMA Ophthalmology (2019)
    Preview abstract Importance More than 60 million people in India have diabetes and are at risk for diabetic retinopathy (DR), a vision-threatening disease. Automated interpretation of retinal fundus photographs can help support and scale a robust screening program to detect DR. Objective To prospectively validate the performance of an automated DR system across 2 sites in India. Design, Setting, and Participants This prospective observational study was conducted at 2 eye care centers in India (Aravind Eye Hospital and Sankara Nethralaya) and included 3049 patients with diabetes. Data collection and patient enrollment took place between April 2016 and July 2016 at Aravind and May 2016 and April 2017 at Sankara Nethralaya. The model was trained and fixed in March 2016. Interventions Automated DR grading system compared with manual grading by 1 trained grader and 1 retina specialist from each site. Adjudication by a panel of 3 retinal specialists served as the reference standard in the cases of disagreement. Main Outcomes and Measures Sensitivity and specificity for moderate or worse DR or referable diabetic macula edema. Results Of 3049 patients, 1091 (35.8%) were women and the mean (SD) age for patients at Aravind and Sankara Nethralaya was 56.6 (9.0) years and 56.0 (10.0) years, respectively. For moderate or worse DR, the sensitivity and specificity for manual grading by individual nonadjudicator graders ranged from 73.4% to 89.8% and from 83.5% to 98.7%, respectively. The automated DR system’s performance was equal to or exceeded manual grading, with an 88.9% sensitivity (95% CI, 85.8-91.5), 92.2% specificity (95% CI, 90.3-93.8), and an area under the curve of 0.963 on the data set from Aravind Eye Hospital and 92.1% sensitivity (95% CI, 90.1-93.8), 95.2% specificity (95% CI, 94.2-96.1), and an area under the curve of 0.980 on the data set from Sankara Nethralaya. Conclusions and Relevance This study shows that the automated DR system generalizes to this population of Indian patients in a prospective setting and demonstrates the feasibility of using an automated DR grading system to expand screening programs. View details
    Preview abstract Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making. View details
    Preview abstract Background Artificial intelligence (AI) research in healthcare is accelerating rapidly with potential applications being demonstrated across many different domains of medicine. However, there are currently limited examples of such techniques being successfully deployed into clinical practice. This article explores the main challenges and limitations of AI in healthcare, and considers steps required to translate these potentially transformative technologies from research to clinical practice. Main body Key challenges for the translation of AI systems in healthcare include those intrinsic to the science of machine learning, logistical difficulties in implementation, and consideration of barriers to adoption or necessary sociocultural or pathway change. Robust peer-reviewed clinical evaluation as part of randomised controlled trials should be viewed as the gold standard for evidence generation, but conducting these in practice may not always be appropriate or feasible. Performance metrics should aim to capture real clinical applicability, and be understandable to intended users. Regulation that balances pace of innovation with potential for harm, alongside thoughtful postmarket surveillance, is required to ensure that patients are not exposed to dangerous interventions, nor deprived of access to beneficial innovations. Mechanisms to enable direct comparisons of AI systems must be developed, including the use of independent, local and representative test sets. Developers of AI algorithms must be vigilant to potential dangers including dataset shift, accidental fitting of confounders, unintended discriminatory bias, the challenges of generalisation to new populations, and unintended negative consequences of new algorithms on health outcomes. Conclusions The safe and timely translation of AI research into clinically validated tools that can benefit everyone is challenging. Further work is required to continue developing robust clinical evaluation and regulatory frameworks using metrics that are intuitive to clinicians, identifying themes of algorithmic bias and unfairness while developing mitigations to address this, reducing brittleness and improving generalisability, and developing methods for improved interpretability of machine learning models. If these goals can be achieved, the benefits for patients are likely to be transformational. View details
    An Augmented Reality Microscope with Real-time Artificial Intelligence Integration for Cancer Diagnosis
    Cameron Chen
    Krishna Kumar Gadepalli
    Bob MacDonald
    Shiro Kadowaki
    Kunal Nagpal
    Timo Kohlberger
    Jason Hipp
    Craig Mermel
    Martin Stumpe
    Nature Medicine (2019)
    Preview abstract The microscopic assessment of tissue samples is instrumental for the diagnosis and staging of cancer and thus guides therapy. However, these assessments demonstrate significant variability, and many regions of the world lack access to trained pathologists. Though Artificial Intelligence (AI) promises to improve the access and quality of healthcare, the costs of image digitization in pathology and difficulties in deploying AI solutions remain as barriers to real-world use. Here we propose a cost-effective solution: the Augmented Reality Microscope (ARM). The ARM overlays AI-based information onto the current view of the sample in real-time, enabling seamless integration of AI into routine workflows. We demonstrate the utility of ARM in the detection of metastatic breast cancer and the identification of prostate cancer with latency compatible with real-time use. We anticipate that the ARM will remove barriers towards the use of AI designed to improve the accuracy and efficiency of cancer diagnosis. View details
    End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography
    Diego Ardila
    Atilla Peter Kiraly
    Sujeeth Bharadwaj
    Bokyung Choi
    Joshua Reicher
    Lily Peng
    Mozziyar Etemadi
    David Naidich
    Shravya Ramesh Shetty
    Nature Medicine (2019)
    Preview abstract With an estimated 160,000 deaths in 2018, lung cancer is the most common cause of cancer death in the United States1. Lung cancer screening using low-dose computed tomography has been shown to reduce mortality by 20–43% and is now included in US screening guidelines. Existing challenges include inter-grader variability and high false-positive and false-negative rates. We propose a deep learning algorithm that uses a patient’s current and prior computed tomography volumes to predict the risk of lung cancer. Our model achieves a state-of-the-art performance (94.4% area under the curve) on 6,716 National Lung Cancer Screening Trial cases, and performs similarly on an independent clinical validation set of 1,139 cases. We conducted two reader studies. When prior computed tomography imaging was not available, our model outperformed all six radiologists with absolute reductions of 11% in false positives and 5% in false negatives. Where prior computed tomography imaging was available, the model performance was on-par with the same radiologists. This creates an opportunity to optimize the screening process via computer assistance and automation. While the vast majority of patients remain unscreened, we show the potential for deep learning models to increase the accuracy, consistency and adoption of lung cancer screening worldwide. View details
    Predicting Anemia from Fundus Images
    Akinori Mitani
    Abigail Huang
    Subhashini Venugopalan
    Lily Peng
    Naama Hammel
    Avinash Vaidyanathan Varadarajan
    Nature Biomedical Engineering (2019)
    Preview abstract Owing to the invasiveness of diagnostic tests for anaemia and the costs associated with screening for it, the condition is often undetected. Here, we show that anaemia can be detected via machine-learning algorithms trained using retinal fundus images, study participant metadata (including race or ethnicity, age, sex and blood pressure) or the combination of both data types (images and study participant metadata). In a validation dataset of 11,388 study participants from the UK Biobank, the fundusimage-only, metadata-only and combined models predicted haemoglobin concentration (in g dl–1) with mean absolute error values of 0.73 (95% confidence interval: 0.72–0.74), 0.67 (0.66–0.68) and 0.63 (0.62–0.64), respectively, and with areas under the receiver operating characteristic curve (AUC) values of 0.74 (0.71–0.76), 0.87 (0.85–0.89) and 0.88 (0.86–0.89), respectively. For 539 study participants with self-reported diabetes, the combined model predicted haemoglobin concentration with a mean absolute error of 0.73 (0.68–0.78) and anaemia an AUC of 0.89 (0.85–0.93). Automated anaemia screening on the basis of fundus images could particularly aid patients with diabetes undergoing regular retinal imaging and for whom anaemia can increase morbidity and mortality risks. View details
    Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer
    Kunal Nagpal
    Davis Foote
    Cameron Chen
    Fraser Tan
    Niels Olson
    Jenny Smith
    Arash Mohtashamian
    James H. Wren
    Robert MacDonald
    Lily Peng
    Mahul Amin
    Andrew Evans
    Ankur Sangoi
    Craig Mermel
    Jason Hipp
    Martin Stumpe
    Nature Partner Journal (npj) Digital Medicine, vol. 2 (2019), pp. 48
    Preview abstract For prostate cancer patients, the Gleason score is one of the most important prognostic factors, potentially determining treatment independent of the stage. However, Gleason scoring is based on subjective microscopic examination of tumor morphology and suffers from poor reproducibility. Here we present a deep learning system (DLS) for Gleason scoring whole-slide images of prostatectomies. Our system was developed using 112 million pathologist-annotated image patches from 1226 slides, and evaluated on an independent validation dataset of 331 slides. Compared to a reference standard provided by genitourinary pathology experts, the mean accuracy among 29 general pathologists was 0.61 on the validation set. The DLS achieved a significantly higher diagnostic accuracy of 0.70 (p = 0.002) and trended towards better patient risk stratification in correlations to clinical follow-up data. Our approach could improve the accuracy of Gleason scoring and subsequent therapy decisions, particularly where specialist expertise is unavailable. The DLS also goes beyond the current Gleason system to more finely characterize and quantitate tumor morphology, providing opportunities for refinement of the Gleason system itself. View details
    Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation
    Anna Dagna Majkowska
    Sid Mittal
    Joshua Reicher
    Scott Mayer McKinney
    Gavin Duggan
    Krish Eswaran
    Cameron Chen
    Sreenivasa Raju Kalidindi
    Alexander Ding
    Shravya Ramesh Shetty
    Radiology (2019)
    Preview abstract Background Deep learning has the potential to augment the use of chest radiography in clinical radiology, but challenges include poor generalizability, spectrum bias, and difficulty comparing across studies. Purpose To develop and evaluate deep learning models for chest radiograph interpretation by using radiologist-adjudicated reference standards. Materials and Methods Deep learning models were developed to detect four findings (pneumothorax, opacity, nodule or mass, and fracture) on frontal chest radiographs. This retrospective study used two data sets. Data set 1 (DS1) consisted of 759 611 images from a multicity hospital network and ChestX-ray14 is a publicly available data set with 112 120 images. Natural language processing and expert review of a subset of images provided labels for 657 954 training images. Test sets consisted of 1818 and 1962 images from DS1 and ChestX-ray14, respectively. Reference standards were defined by radiologist-adjudicated image review. Performance was evaluated by area under the receiver operating characteristic curve analysis, sensitivity, specificity, and positive predictive value. Four radiologists reviewed test set images for performance comparison. Inverse probability weighting was applied to DS1 to account for positive radiograph enrichment and estimate population-level performance. Results In DS1, population-adjusted areas under the receiver operating characteristic curve for pneumothorax, nodule or mass, airspace opacity, and fracture were, respectively, 0.95 (95% confidence interval [CI]: 0.91, 0.99), 0.72 (95% CI: 0.66, 0.77), 0.91 (95% CI: 0.88, 0.93), and 0.86 (95% CI: 0.79, 0.92). With ChestX-ray14, areas under the receiver operating characteristic curve were 0.94 (95% CI: 0.93, 0.96), 0.91 (95% CI: 0.89, 0.93), 0.94 (95% CI: 0.93, 0.95), and 0.81 (95% CI: 0.75, 0.86), respectively. Conclusion Expert-level models for detecting clinically relevant chest radiograph findings were developed for this study by using adjudicated reference standards and with population-level performance estimation. Radiologist-adjudicated labels for 2412 ChestX-ray14 validation set images and 1962 test set images are provided. View details
    Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program
    Dr. Paisan Raumviboonsuk
    Jonathan Krause
    Dr. Peranut Chotcomwongse
    Rory Abbott Sayres
    Rajiv Raman
    Sonia Phene
    Kornwipa Hemarat
    Mongkol Tadarati
    Sukhum Silpa-Archa
    Jirawut Limwattanayingyong
    Chetan Rao
    Oscar Kuruvilla
    Jesse Jung
    Jeffrey Tan
    Surapong Orprayoon
    Chawawat Kangwanwongpaisan
    Ramase Sukumalpaiboon
    Chainarong Luengchaichawang
    Jitumporn Fuangkaew
    Pipat Kongsap
    Lamyong Chualinpha
    Sarawuth Saree
    Srirut Kawinpanitan
    Korntip Mitvongsa
    Siriporn Lawanasakol
    Chaiyasit Thepchatri
    Lalita Wongpichedchai
    Lily Peng
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening. View details
    Using a deep learning algorithm and integrated gradient explanation to assist grading for diabetic retinopathy
    Ankur Taly
    Anthony Joseph
    Arjun Sood
    Arun Narayanaswamy
    Derek Wu
    Ehsan Rahimy
    Jesse Smith
    Jonathan Krause
    Katy Blumer
    Lily Peng
    Michael Shumski
    Naama Hammel
    Rory Abbott Sayres
    Scott Barb
    Zahra Rastegar
    Ophthalmology (2019)
    Preview abstract Background Deep learning methods have recently produced algorithms that can detect disease such as diabetic retinopathy (DR) with doctor-level accuracy. We sought to understand the impact of these models on physician graders in assisted-read settings. Methods We surfaced model predictions and explanation maps ("masks") to 9 ophthalmologists with varying levels of experience to read 1,804 images each for DR severity based on the International Clinical Diabetic Retinopathy (ICDR) disease severity scale. The image sample was representative of the diabetic screening population, and was adjudicated by 3 retina specialists for a reference standard. Doctors read each image in one of 3 conditions: Unassisted, Grades Only, or Grades+Masks. Findings Readers graded DR more accurately with model assistance than without (p < 0.001, logistic regression). Compared to the adjudicated reference standard, for cases with disease, 5-class accuracy was 57.5% for the model. For graders, 5-class accuracy for cases with disease was 47.5 ± 5.6% unassisted, 56.9 ± 5.5% with Grades Only, and 61.5 ± 5.5% with Grades+Mask. Reader performance improved with assistance across all levels of DR, including for severe and proliferative DR. Model assistance increased the accuracy of retina fellows and trainees above that of the unassisted grader or model alone. Doctors’ grading confidence scores and read times both increased overall with assistance. For most cases, Grades + Masks was as only effective as Grades Only, though masks provided additional benefit over grades alone in cases with: some DR and low model certainty; low image quality; and proliferative diabetic retinopathy (PDR) with features that were frequently missed, such as panretinal photocoagulation (PRP) scars. Interpretation Taken together, these results show that deep learning models can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting. View details
    Similar Image Search for Histopathology: SMILY
    Jason Hipp
    Michael Emmert-Buck
    Daniel Smilkov
    Mahul Amin
    Craig Mermel
    Lily Peng
    Martin Stumpe
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract The increasing availability of large institutional and public histopathology image datasets is enabling the searching of these datasets for diagnosis, research, and education. Although these datasets typically have associated metadata such as diagnosis or clinical notes, even carefully curated datasets rarely contain annotations of the location of regions of interest on each image. As pathology images are extremely large (up to 100,000 pixels in each dimension), further laborious visual search of each image may be needed to find the feature of interest. In this paper, we introduce a deep-learning-based reverse image search tool for histopathology images: Similar Medical Images Like Yours (SMILY). We assessed SMILY’s ability to retrieve search results in two ways: using pathologist-provided annotations, and via prospective studies where pathologists evaluated the quality of SMILY search results. As a negative control in the second evaluation, pathologists were blinded to whether search results were retrieved by SMILY or randomly. In both types of assessments, SMILY was able to retrieve search results with similar histologic features, organ site, and prostate cancer Gleason grade compared with the original query. SMILY may be a useful general purpose tool in the pathologist’s arsenal, to improve the efficiency of searching large archives of histopathology images, without the need to develop and implement specific tools for each application. View details
    TensorFlow.js: Machine Learning for the Web and Beyond
    Daniel Smilkov
    Nikhil Thorat
    Yannick Assogba
    Ann Yuan
    Nick Kreeger
    Ping Yu
    Kangyi Zhang
    Eric Nielsen
    Stan Bileschi
    Charles Nicholson
    Sandeep N. Gupta
    Sarah Sirajuddin
    Rajat Monga
    SysML, Palo Alto, CA, USA (2019)
    Preview abstract TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases. View details
    Remote Tool-based Adjudication for Grading Diabetic Retinopathy
    Mike Schaekermann
    Naama Hammel
    Tayyeba Ali
    Brian Basham
    Will Chen
    Xiang Ji
    Jonathan Krause
    Lily Peng
    Edith Law
    Rory Abbott Sayres
    Translational Vision Science & Technology (TVST) (2019)
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    Predicting Progression of Age-related Macular Degeneration from Fundus Images using Deep Learning
    Boris Babenko
    Siva Balasubramanian
    Katy Blumer
    Lily Peng
    Naama Hammel
    Avinash Vaidyanathan Varadarajan
    arXiv (2019)
    Preview abstract Background: Patients with neovascular age-related macular degeneration (AMD) can avoid vision loss via certain therapy. However, methods to predict the progression to neovascular age-related macular degeneration (nvAMD) are lacking. Purpose: To develop and validate a deep learning (DL) algorithm to predict 1-year progression of eyes with no, early, or intermediate AMD to nvAMD, using color fundus photographs (CFP). Design: Development and validation of a DL algorithm. Methods: We trained a DL algorithm to predict 1-year progression to nvAMD, and used 10-fold cross-validation to evaluate this approach on two groups of eyes in the Age-Related Eye Disease Study (AREDS): none/early/intermediate AMD, and intermediate AMD (iAMD) only. We compared the DL algorithm to the manually graded 4-category and 9-step scales in the AREDS dataset. Main outcome measures: Performance of the DL algorithm was evaluated using the sensitivity at 80% specificity for progression to nvAMD. Results: The DL algorithm's sensitivity for predicting progression to nvAMD from none/early/iAMD (78+/-6%) was higher than manual grades from the 9-step scale (67+/-8%) or the 4-category scale (48+/-3%). For predicting progression specifically from iAMD, the DL algorithm's sensitivity (57+/-6%) was also higher compared to the 9-step grades (36+/-8%) and the 4-category grades (20+/-0%). Conclusions: Our DL algorithm performed better in predicting progression to nvAMD than manual grades. Future investigations are required to test the application of this DL algorithm in a real-world clinical setting. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Kai Chen
    Nissan Hajaj
    Mila Hardt
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine (2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
    Ensuring Fairness in Machine Learning to Advance Health Equity
    Alvin Rishi Rajkomar
    Marshall Chin
    Mila Hardt
    Annals of Internal Medicine (2018)
    Preview abstract A central promise of machine learning (ML) is to use historical data to project the future trajectories of patients. Will they have a good or bad outcome? What diagnoses will they have? What treatments should they be given? But in many cases, we do not want the future to look like the past, especially when the past contains patterns of human or structural biases against vulnerable populations. This is not an abstract problem. In a model used to predict future crime based on historical records, black defendants who did not re-offend were classified as high-risk at a substantially higher rate than white defendants who did not re-offend.22 Similar biases have been observed in predictive policing,23 social services,24 and technology companies.25 Given known healthcare disparities, this problem will nearly inevitably surface in medical domains where ML could be applied(Table 1): a "protected group" could be systematically excluded from the benefits of a ML system or even harmed. We argue that ML systems should be fair, which is defined in medical ethics as the "moral obligation to act on the basis of fair adjudication between competing claims."26 Recent advances in computer science have offered mathematical and procedural suggestions to make a ML system fairer: ensures it is equally accurate for patients in a protected class, allocates resources to protected classes proportional to need, leads to better patient outcomes for all, and is built and tested in ways that protect the privacy, expectations, and trust of patients. To guide clinicians, administrators, policymakers, and regulators in making principled decisions to improve ML fairness, we illustrate the mechanisms by which a model could be unfair. We then review both technical and non-technical solutions to improve fairness. Finally, we make policy recommendations to stakeholders specifying roles, responsibilities and oversight. View details
    Preview abstract Purpose Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading. Design Retrospective analysis. Participants Retinal fundus images from DR screening programs. Methods Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard. Main Outcome Measures For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity. Results Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR. Conclusions Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists. View details
    Predicting Cardiovascular Risk Factors in Retinal Fundus Photographs using Deep Learning
    Avinash Vaidyanathan Varadarajan
    Katy Blumer
    Mike McConnell
    Lily Peng
    Nature Biomedical Engineering (2018)
    Preview abstract Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction. View details
    Preview abstract Objective: Refractive error, one of the leading cause of visual impairment, can be corrected by simple interventions like prescribing eyeglasses, which often starts with autorefraction to estimate the refractive error. In this study, using deep learning, we trained a network to estimate refractive error from fundus photos only. Design: Retrospective analysis. Subjects, Participants, and/or Controls: Retinal fundus images from participants in the UK Biobank cohort, which were 45 degree field of view images and the AREDS clinical trial, which contained 30 degree field of view images. Methods, Intervention, or Testing: Refractive error was measured by autorefraction in the UK Biobank dataset and subjective refraction in the AREDS dataset. We trained a deep learning algorithm to predict refractive error from the fundus photographs and tested the prediction of the algorithm to the documented refractive error measurement. Our model used attention for identifying features that are predictive for refractive error. Main Outcome Measures: Mean average error (MAE) of the algorithm’s prediction compared to the refractive error obtained in the AREDS and UK Biobank. Results: The resulting algorithm had a mean average error (MAE) of 0.56 diopters (95% CI: 0.55-0.56) for estimating spherical equivalent on the UK Biobank dataset and 0.91 diopters (95% CI: 0.89-0.92) for the AREDS dataset. The baseline expected MAE (obtained by simply predicting the mean of this population) is 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggest that the foveal region is one of the most important areas that is used by the algorithm to make this prediction, though other regions also contribute to the prediction. Conclusions: The ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images. In addition, given that several groups have recently shown that it is feasible to obtain retinal fundus photos using mobile phones and inexpensive attachments, this work may be particularly relevant in regions of the world where autorefractors may not be readily available. View details
    Improving the specificity of lung cancer screening CT using deep learning
    Diego Ardila
    Bokyung Choi
    Atilla Peter Kiraly
    Sujeeth Bharadwaj
    Joshua Reicher
    Lily Peng
    Shravya Ramesh Shetty
    RSNA (2018)
    Preview abstract PURPOSE Evaluate the utility of deep learning to improve the specificity and sensitivity of lung cancer screening with low-dose helical computed tomography (LDCT), relative to the Lung-RADS guidelines. METHOD AND MATERIALS We analyzed 42,943 CT studies from 14,863 patients, 620 of which developed biopsy-confirmed cancer. All cases were from the National Lung Screening Trial (NLST) study. We randomly split patients into a training (70%), tuning (15%) and test (15%) sets. A study was marked "true" if the patient was diagnosed with biopsy confirmed lung cancer in the same screening year as the study.A deep learning model was trained over 3D CT volumes (400x512x512) as input. We used the 95% specificity operating point based on the tuning set, and evaluated our approach on the test set. To estimate radiologist performance, we retrospectively applied Lung-RADS criteria to each study in the test set. Lung-RADS categories 1 to 2 constitute negative screening results, and categories 3 to 4 constitute positive results. Neither the model nor the Lung-RADS results took into account prior studies, but all screening years were utilized in evaluation. RESULTS The area under the receiver operator curve of the deep learning model was 94.2% (95% CI 91.0, 96.9). Compared to Lung-RADS on the test set, the trained model achieved a statistically significant absolute 9.2% (95% CI 8.4, 10.1) higher specificity and trended a 3.4% (95% CI -5.2, 12.6) higher sensitivity (not statistically significant).Radiologists qualitatively reviewed disagreements between the model and Lung-RADS. Preliminary analysis suggests that the model may be superior in distinguishing scarring from early malignancy. CONCLUSION A deep learning based model improved the specificity of lung cancer screening over Lung-RADS on the NLST dataset and could potentially help reduce unnecessary procedures. This research could supplement future versions of Lung-RADS; or support assisted read or second read workflows. CLINICAL RELEVANCE/APPLICATION While Lung-RADS criteria is recommended for lung cancer screening with LDCT, there is still an opportunity to reduce false-positive rates which lead to unnecessary invasive procedures. View details
    Detecting Cancer Metastases on Gigapixel Pathology Images
    Krishna Kumar Gadepalli
    Mohammad Norouzi
    George Dahl
    Timo Kohlberger
    Subhashini Venugopalan
    Aleksei Timofeev
    Jason Hipp
    Lily Peng
    Martin Stumpe
    arXiv (2017)
    Preview abstract Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x 100 pixels in gigapixel microscopy images sized 100,000 x 100,000 pixels. Our method leverages a convolutional neural network (CNN) architecture and obtains state-of-the-art results on the Camelyon16 dataset in the challenging lesion-level tumor detection task. At 8 false positives per image, we detect 92.4% of the tumors, relative to 82.7% by the previous best automated approach. For comparison, a human pathologist attempting exhaustive search achieved 73.2% sensitivity. We achieve image-level AUC scores above 97% on both the Camelyon16 test set and an independent set of 110 slides. In addition, we discover that two slides in the Camelyon16 training set were erroneously labeled normal. Our approach could considerably reduce false negative rates in metastasis detection. View details
    Wide & Deep Learning for Recommender Systems
    Levent Koc
    Tal Shaked
    Glen Anderson
    Wei Chai
    Mustafa Ispir
    Rohan Anil
    Lichan Hong
    Vihan Jain
    Xiaobing Liu
    Hemal Shah
    arXiv:1606.07792 (2016)
    Preview abstract Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. View details
    Preview abstract We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English->French and surpasses state-of-the-art results for English->German. Similarly, a single multilingual model surpasses state-of-the-art results for French->English and German->English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages. View details
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Mike Schuster
    Mohammad Norouzi
    Maxim Krikun
    Yuan Cao
    Qin Gao
    Apurva Shah
    Xiaobing Liu
    Łukasz Kaiser
    Stephan Gouws
    Taku Kudo
    Keith Stevens
    George Kurian
    Nishant Patil
    Wei Wang
    Cliff Young
    Jason Smith
    Alex Rudnick
    Macduff Hughes
    CoRR, vol. abs/1609.08144 (2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details
    Smart Reply: Automated Response Suggestion for Email
    Karol Kurach
    Sujith Ravi
    Tobias Kaufman
    Laszlo Lukacs
    Peter Young
    Vivek Ramavajjala
    Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2016).
    Preview abstract In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning. We describe the architecture of the system as well as the challenges that we faced while building it, like response diversity and scalability. We also introduce a new method for semantic clustering of user-generated content that requires only a modest amount of explicitly labeled data. View details
    BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
    Stephan Gouws
    Yoshua Bengio
    Proceedings of the 32nd International Conference on Machine Learning (2015)
    Preview abstract We introduce BilBOWA (Bilingual Bag-ofWords without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data. View details
    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
    Ashish Agarwal
    Eugene Brevdo
    Craig Citro
    Matthieu Devin
    Ian Goodfellow
    Andrew Harp
    Geoffrey Irving
    Yangqing Jia
    Rafal Jozefowicz
    Lukasz Kaiser
    Manjunath Kudlur
    Dan Mané
    Rajat Monga
    Chris Olah
    Mike Schuster
    Jonathon Shlens
    Benoit Steiner
    Ilya Sutskever
    Kunal Talwar
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    tensorflow.org (2015)
    Preview abstract TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org. View details
    Zero-Shot Learning by Convex Combination of Semantic Embeddings
    Mohammad Norouzi
    Tomas Mikolov
    Samy Bengio
    Yoram Singer
    Jonathon Shlens
    Andrea Frome
    International Conference on Learning Representations (2014)
    Preview abstract Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional \nway{} classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing \nway{} image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zero-shot learning task. View details
    Distributed Representations of Words and Phrases and their Compositionality
    Tomas Mikolov
    Ilya Sutskever
    Kai Chen
    Neural and Information Processing System (NIPS) (2013)
    Preview abstract The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. View details
    Efficient Estimation of Word Representations in Vector Space
    Tomas Mikolov
    Kai Chen
    International Conference on Learning Representations (2013)
    Preview abstract We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. View details
    DeViSE: A Deep Visual-Semantic Embedding Model
    Andrea Frome
    Jonathon Shlens
    Samy Bengio
    Marc’Aurelio Ranzato
    Tomas Mikolov
    Neural Information Processing Systems (NIPS) (2013)
    Preview abstract Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. View details
    Building high-level features using large scale unsupervised learning
    Marc'Aurelio Ranzato
    Rajat Monga
    Matthieu Devin
    Kai Chen
    Andrew Ng
    International Conference in Machine Learning (2012)
    Preview abstract We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art. View details
    Large Scale Distributed Deep Networks
    Rajat Monga
    Kai Chen
    Matthieu Devin
    Mark Z. Mao
    Marc’Aurelio Ranzato
    Paul Tucker
    Ke Yang
    Andrew Y. Ng
    NIPS (2012)
    Preview abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. View details
    Three Controversial Hypotheses Concerning Computation in the Primate Cortex
    Thomas Dean
    Jonathon Shlens
    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI Press (2012)
    Preview abstract We consider three hypotheses concerning the primate neocortex which have influenced computational neuroscience in recent years. Is the mind modular in terms of its being profitably described as a collection of relatively independent functional units? Does the regular structure of the cortex imply a single algorithm at work, operating on many different inputs in parallel? Can the cognitive differences between humans and our closest primate relatives be explained in terms of a scalable cortical architecture? We bring to bear diverse sources of evidence to argue that the answers to each of these questions - with some judicious qualifications - are in the affirmative. In particular, we argue that while our higher cognitive functions may interact in a complicated fashion, many of the component functions operate through well-defined interfaces and, perhaps more important, are built on a neural substrate that scales easily under the control of a modular genetic architecture. Processing in the primary sensory cortices seem amenable to similar algorithmic principles, and, even for those cases where alternative principles are at play, the regular structure of cortex allows the same or greater advantages as the architecture scales. Similar genetic machinery to that used by nature to scale body plans has apparently been applied to scale cortical computations. The resulting replicated computing units can be used to build larger working memory and support deeper recursions needed to qualitatively improve our abilities to handle language, abstraction and social interaction. View details
    Preview abstract We present a new approach to learning sparse, spatiotemporal features and demonstrate the utility of the approach by applying the resulting sparse codes to the problem of activity recognition. Learning features that discriminate among human activities in video is difficult in part because the stable space-time events that reliably characterize the relevant motions are rare. To overcome this problem, we adopt a multi-stage approach to activity recognition. In the initial preprocessing stage, we first whiten and apply local contrast normalization to each frame of the video. We then apply an additional set of filters to identify and extract salient space-time volumes that exhibit smooth periodic motion. We collect a large corpus of these space-time volumes as training data for the unsupervised learning of a sparse, over-complete basis using a variant of the two-phase analysis-synthesis algorithm of Olshausen and Field [1997]. We treat the synthesis phase, which consists of reconstructing the input as sparse a mostly coefficient zero and most importantly the time required for reconstruction in subsequent use production we adapted existing algorithms to exploit potential parallelism through the use of readily-available SIMD hardware. To obtain better codes, we developed a new approach to learning sparse, spatiotemporal codes in which the number of basis vectors, their orientations, velocities and the size of their receptive fields change over the duration of unsupervised training. The algorithm starts with a relatively small, initial basis with minimal temporal extent. This initial basis is obtained through conventional sparse coding techniques and is expanded over time by recursively constructing a new basis consisting of basis vectors with larger temporal extent that proportionally conserve regions of previously trained weights. These proportionally conserved weights are combined with the result of adjusting newly added weights to represent a greater range of primitive motion features. The size of the current basis is determined probabilistically by sampling from existing basis vectors according to their activation on the training set. The resulting algorithm produces bases consisting of filters that are bandpass, spatially oriented and temporally diverse in terms of their transformations and velocities. We demonstrate the utility of our approach by using it to recognize human activity in video. View details
    Recursive Sparse Spatiotemporal Coding
    Thomas Dean
    Proceedings of the Fifth IEEE International Workshop on Multimedia Information Processing and Retrieval, IEEE Computer Society (2009)
    Preview abstract We present a new approach to learning sparse, spatiotemporal codes in which the number of basis vectors, their orientations, velocities and the size of their receptive fields change over the duration of unsupervised training. The algorithm starts with a relatively small, initial basis with minimal temporal extent. This initial basis is obtained through conventional sparse coding techniques and is expanded over time by recursively constructing a new basis consisting of basis vectors with larger temporal extent that proportionally conserve regions of previously trained weights. These proportionally conserved weights are combined with the result of adjusting newly added weights to represent a greater range of primitive motion features. The size of the current basis is determined probabilistically by sampling from existing basis vectors according to their activation on the training set. The resulting algorithm produces bases consisting of filters that are bandpass, spatially oriented and temporally diverse in terms of their transformations and velocities. The basic methodology borrows inspiration from the layer-by-layer learning of multiple-layer restricted Boltzmann machines developed by Geoff Hinton and his students. Indeed, we can learn multiple-layer sparse codes by training a stack of denoising autoencoders, but we have had greater success using L1 regularized regression in a variation on Olshausen and Field's original SPARSENET. To accelerate learning and focus attention, we apply a space-time interest-point operator that selects for periodic motion. This attentional mechanism enables us to efficiently compute and compactly represent a broad range of interesting motion. We demonstrate the utility of our approach by using it to recognize human activity in video. Our algorithm meets or exceeds the performance of state-of-the-art activity-recognition methods. View details
    No Results Found