Greg Corrado

Greg Corrado

Greg Corrado is a senior research scientist interested in biological neuroscience, artificial intelligence, and scalable machine learning. He has published in fields ranging across behavioral economics, neuromorphic device physics, systems neuroscience, and deep learning. At Google he has worked for some time on brain inspired computing, and most recently has served as one of the founding members and the co-technical lead of Google's large scale deep neural networks project.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Preview abstract Background: Skin conditions are extremely common worldwide, and are an important cause of both anxiety and morbidity. Since the advent of the internet, individuals have used text-based search (eg, “red rash on arm”) to learn more about concerns on their skin, but this process is often hindered by the inability to accurately describe the lesion’s morphology. In the study, we surveyed respondents’ experiences with an image-based search, compared to the traditional text-based search experience. Methods: An internet-based survey was conducted to evaluate the experience of text-based vs image-based search for skin conditions. We recruited respondents from an existing cohort of volunteers in a commercial survey panel; survey respondents that met inclusion/exclusion criteria, including willingness to take photos of a visible concern on their body, were enrolled. Respondents were asked to use the Google mobile app to conduct both regular text-based search (Google Search) and image-based search (Google Lens) for their concern, with the order of text vs. image search randomized. Satisfaction for each search experience along six different dimensions were recorded and compared, and respondents’ preferences for the different search types along these same six dimensions were recorded. Results: 372 respondents were enrolled in the study, with 44% self-identifying as women, 86% as White and 41% over age 45. The rate of respondents who were at least moderately familiar with searching for skin conditions using text-based search versus image-based search were 81.5% and 63.5%, respectively. After using both search modalities, respondents were highly satisfied with both image-based and text-based search, with >90% at least somewhat satisfied in each dimension and no significant differences seen between text-based and image-based search when examining the responses on an absolute scale per search modality. When asked to directly rate their preferences in a comparative way, survey respondents preferred image-based search over text-based search in 5 out of 6 dimensions, with an absolute 9.9% more preferring image-based search over text-based search overall (p=0.004). 82.5% (95% CI 78.2 - 86.3) reported a preference to leverage image-based search (alone or in combination with text-based search) in future searches. Of those who would prefer to use a combination of both, 64% indicated they would like to start with image-based search, indicating that image-based search may be the preferred entry point for skin-related searches. Conclusion: Despite being less familiar with image-based search upon study inception, survey respondents generally preferred image-based search to text-based search and overwhelmingly wanted to include this in future searches. These results suggest the potential for image-based search to play a key role in people searching for information regarding skin concerns. View details
    Preview abstract Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision–language modeling raise new oppor- tunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image–text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision–language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image–text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings. View details
    Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
    Atilla Kiraly
    Corbin Cunningham
    Ryan Najafi
    Jie Yang
    Chuck Lau
    Diego Ardila
    Scott Mayer McKinney
    Rory Pilgrim
    Mozziyar Etemadi
    Sunny Jansen
    Lily Peng
    Shravya Shetty
    Neeral Beladia
    Krish Eswaran
    Radiology: Artificial Intelligence (2024)
    Preview abstract Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity. View details
    Preview abstract Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. View details
    Prospective Multi-Site Validation of AI to Detect Tuberculosis and Chest X-Ray Abnormalities
    Sahar Kazemzadeh
    Atilla Kiraly
    Nsala Sanjase
    Minyoi Maimbolwa
    Brian Shuma
    Shahar Jamshy
    Christina Chen
    Arnav Agharwal
    Chuck Lau
    Daniel Golden
    Jin Yu
    Eric Wu
    Kat Chou
    Shravya Shetty
    Krish Eswaran
    Rory Pilgrim
    Monde Muyoyeta
    NEJM AI (2024)
    Preview abstract Background Using artificial intelligence (AI) to interpret chest X-rays (CXRs) could support accessible triage tests for active pulmonary tuberculosis (TB) in resource-constrained settings. Methods The performance of two cloud-based CXR AI systems — one to detect TB and the other to detect CXR abnormalities — in a population with a high TB and human immunodeficiency virus (HIV) burden was evaluated. We recruited 1978 adults who had TB symptoms, were close contacts of known TB patients, or were newly diagnosed with HIV at three clinical sites. The TB-detecting AI (TB AI) scores were converted to binary using two thresholds: a high-sensitivity threshold and an exploratory threshold designed to resemble radiologist performance. Ten radiologists reviewed images for signs of TB, blinded to the reference standard. Primary analysis measured AI detection noninferiority to radiologist performance. Secondary analysis evaluated AI detection as compared with the World Health Organization (WHO) targets (90% sensitivity, 70% specificity). Both used an absolute margin of 5%. The abnormality-detecting AI (abnormality AI) was evaluated for noninferiority to a high-sensitivity target suitable for triaging (90% sensitivity, 50% specificity). Results Of the 1910 patients analyzed, 1827 (96%) had conclusive TB status, of which 649 (36%) were HIV positive and 192 (11%) were TB positive. The TB AI’s sensitivity and specificity were 87% and 70%, respectively, at the high-sensitivity threshold and 78% and 82%, respectively, at the balanced threshold. Radiologists’ mean sensitivity was 76% and mean specificity was 82%. At the high-sensitivity threshold, the TB AI was noninferior to average radiologist sensitivity (P<0.001) but not to average radiologist specificity (P=0.99) and was higher than the WHO target for specificity but not sensitivity. At the balanced threshold, the TB AI was comparable to radiologists. The abnormality AI’s sensitivity and specificity were 97% and 79%, respectively, with both meeting the prespecified targets. Conclusions The CXR TB AI was noninferior to radiologists for active pulmonary TB triaging in a population with a high TB and HIV burden. Neither the TB AI nor the radiologists met WHO recommendations for sensitivity in the study population. AI can also be used to detect other CXR abnormalities in the same population. View details
    Preview abstract Background Skin conditions are extremely common worldwide, and are an important cause of both anxiety and morbidity. Since the advent of the internet, individuals have used text-based search (eg, “red rash on arm”) to learn more about concerns on their skin, but this process is often hindered by the inability to accurately describe the lesion’s morphology. In the study, we surveyed respondents’ experiences with an image-based search, compared to the traditional text-based search experience. Methods An internet-based survey was conducted to evaluate the experience of text-based vs image-based search for skin conditions. We recruited respondents from an existing cohort of volunteers in a commercial survey panel; survey respondents that met inclusion/exclusion criteria, including willingness to take photos of a visible concern on their body, were enrolled. Respondents were asked to use the Google mobile app to conduct both regular text-based search (Google Search) and image-based search (Google Lens) for their concern, with the order of text vs. image search randomized. Satisfaction for each search experience along six different dimensions were recorded and compared, and respondents’ preferences for the different search types along these same six dimensions were recorded. Results 372 respondents were enrolled in the study, with 44% self-identifying as women, 86% as White and 41% over age 45. The rate of respondents who were at least moderately familiar with searching for skin conditions using text-based search versus image-based search were 81.5% and 63.5%, respectively. After using both search modalities, respondents were highly satisfied with both image-based and text-based search, with >90% at least somewhat satisfied in each dimension and no significant differences seen between text-based and image-based search when examining the responses on an absolute scale per search modality. When asked to directly rate their preferences in a comparative way, survey respondents preferred image-based search over text-based search in 5 out of 6 dimensions, with an absolute 9.9% more preferring image-based search over text-based search overall (p=0.004). 82.5% (95% CI 78.2 - 86.3) reported a preference to leverage image-based search (alone or in combination with text-based search) in future searches. Of those who would prefer to use a combination of both, 64% indicated they would like to start with image-based search, indicating that image-based search may be the preferred entry point for skin-related searches. Conclusion Despite being less familiar with image-based search upon study inception, survey respondents generally preferred image-based search to text-based search and overwhelmingly wanted to include this in future searches. These results suggest the potential for image-based search to play a key role in people searching for information regarding skin concerns. View details
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details
    Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements
    Abbi Ward
    Jimmy Li
    Julie Wang
    Sriram Lakshminarasimhan
    Ashley Carrick
    Jay Hartford
    Pradeep Kumar S
    Sunny Virmani
    Renee Wong
    Margaret Ann Smith
    Dawn Siegel
    Steven Lin
    Justin Ko
    JAMA Network Open (2024)
    Preview abstract Importance: Health datasets from clinical sources do not reflect the breadth and diversity of disease, impacting research, medical education, and artificial intelligence tool development. Assessments of novel crowdsourcing methods to create health datasets are needed. Objective: To evaluate if web search advertisements (ads) are effective at creating a diverse and representative dermatology image dataset. Design, Setting, and Participants: This prospective observational survey study, conducted from March to November 2023, used Google Search ads to invite internet users in the US to contribute images of dermatology conditions with demographic and symptom information to the Skin Condition Image Network (SCIN) open access dataset. Ads were displayed against dermatology-related search queries on mobile devices, inviting contributions from adults after a digital informed consent process. Contributions were filtered for image safety and measures were taken to protect privacy. Data analysis occurred January to February 2024. Exposure: Dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and estimated Monk Skin Tone (eMST) labels. Main Outcomes and Measures: The primary metrics of interest were the number, quality, demographic diversity, and distribution of clinical conditions in the crowdsourced contributions. Spearman rank order correlation was used for all correlation analyses, and the χ2 test was used to analyze differences between SCIN contributor demographics and the US census. Results: In total, 5749 submissions were received, with a median of 22 (14-30) per day. Of these, 5631 (97.9%) were genuine images of dermatological conditions. Among contributors with self-reported demographic information, female contributors (1732 of 2596 contributors [66.7%]) and younger contributors (1329 of 2556 contributors [52.0%] aged <40 years) had a higher representation in the dataset compared with the US population. Of 2614 contributors who reported race and ethnicity, 852 (32.6%) reported a racial or ethnic identity other than White. Dermatologist confidence in assigning a differential diagnosis increased with the number of self-reported demographic and skin-condition–related variables (Spearman R = 0.1537; P < .001). Of 4019 contributions reporting duration since onset, 2170 (54.0%) reported onset within less than 7 days of submission. Of the 2835 contributions that could be assigned a dermatological differential diagnosis, 2523 (89.0%) were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. Conclusions and Relevance: The findings of this survey study suggest that search ads are effective at crowdsourcing dermatology images and could therefore be a useful method to create health datasets. The SCIN dataset bridges important gaps in the availability of images of common, short-duration skin conditions. View details