Jump to Content
Mike Schaekermann

Mike Schaekermann

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. (Funded by Alphabet Inc. and/or a subsidiary thereof.) View details
    Data Excellence for AI: Why Should You Care
    Matt Lease
    Praveen Kumar Paritosh
    ACM IX Interactions (2022)
    Preview abstract The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which empirical progress is measured. Benchmark datasets such as SQuAD, GLUE, and ImageNet define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the models, e.g., via shared-task challenges or Kaggle contests, rather than critiquing and improving the data environment in which our models operate. Research and community challenges focused on improving the data itself are relatively rare. If “data is the new oil,” our use of data remains crude today, and we are missing work on the refineries by which the data itself could be optimized for more effective use. Important scientific opportunities and value are being neglected [Schaekermann et al., 2020]. Data is potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem. Data issues are often perceived and characterized as mundane and rote, the “pre-processing” that has to be done before the real (modeling) work can be done. For example, Kandel et al. (2012) emphasize that ML practitioners view data wrangling as tedious and time-consuming. However, Sambasivan et al. (2021) provide examples of how data quality is crucial to ensure that AI systems can accurately represent and predict the phenomenon it is claiming to measure. They introduce four classes of Data Cascades: compounding events causing negative, downstream effects from data issues triggered by conventional AI/ML practices that undervalue data quality. This emphasizes the significance of data due to its downstream impact on user wellbeing and societal effects. Real-world datasets are often ‘dirty’, with various data quality problems (Northcutt et al, 2021), with the risk of “garbage in = garbage out” in terms of the downstream AI systems we train and test on such data. This has inspired a steadily growing body of work on understanding and improving data quality (Chu, et al, 2013; Krishnan, et al, 2016; Redman, et al, 2018; Raman et al, 2001). It also highlights the importance of rigorously managing data quality using mechanisms specific to data validation, instead of relying on model performance as a proxy for data quality (Thomas, et al, 2020). Just as we rigorously test our code for software defects before deployment, we might test for data defects with the same degree of rigor, so that we might detect, prevent, or mitigate weaknesses in ML models caused by underlying issues in data quality. The “Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML)” Data Challenge (Aroyo and Paritosh, 2021) aims to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. Similarly to (Vandenhof, 2019) CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassified. This research is inspired by (Attenberg et al, 2015) following the claim “Humans should always be part of machine learning solutions, as they can guide machine learning systems to learn about things that the systems don't yet know — the “unknown unknowns.”” by Iperiotis, (2016). Many benchmark datasets contain instances that are relatively easy (e.g., photos with a subject that is easy to identify). In so doing, they miss the natural ambiguity of the real world in which our models are to be actually applied. Data instances with annotator disagreement are often aggregated to eliminate disagreement (obscuring uncertainty), or filtered out of datasets entirely. Exclusion of difficult and/or ambiguous real-world examples in evaluation risks “toy dataset” benchmarks that diverge from the real data to be encountered in practice. Successful benchmark models fail to generalize to real data, and inflated benchmark results mislead our assessment of state-of-the-art capabilities. ML models become prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set. Measuring data quality is challenging, nebulous, and often circularly defined, with annotated data defining the “ground truth” on which models are trained and tested [Riezler, 2014]. When dataset quality is considered, the ways in which it is measured in practice is often poorly understood and sometimes simply wrong. Challenges identified include fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019]. Measurement of AI success and progress today is often metrics-driven, with emphasis on rigorous measurement and A/B testing. However, measuring goodness of the fit of the model to the dataset completely ignores any consideration of how well the dataset fits the real world problem to be solved and its data. Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity (i.e., how well the dataset represents reality) and validity (how well the data explains things related to the phenomena captured by the data). No standardised metrics exist today for characterising the goodness-of-data [11,13]. Research on metrics is emerging [15,91] but is not yet widely known, accepted, or applied in the AI ecosystem today. As a result, there is an overreliance on goodness-of-fit metrics and post-deployment product metrics. Focusing on fidelity and validity of data will further increase its scientific value and reusability. Such research is necessary for enabling better incentives for data, as it is hard to improve something we can not measure. Researchers in human computation (HCOMP) and various ML-related fields have demonstrated a longstanding interest in applying crowdsourcing approaches to generate human-annotated data for model training and testing [25,128]. A series of workshops (Meta-Eval 2020 @ AAAI, REAIS 2019 @ HCOMP, SAD 2019 @ TheWebConf (WWW), SAD 2018 @ HCOMP) have helped increase further awareness about the issues of data quality for ML evaluation and provide a venue for scholarship on this subject. Because human annotated data represents the compass that the entire ML community relies on, data-focused research, by the HCOMP community and others, can potentially have a multiplicative effect on accelerating progress in ML more broadly. Optimizing the cost, size, and speed of collecting data has attracted significant attention in the first-to-market rush with data. However, aspects of maintainability, reliability, validity, and fidelity of datasets have been often overlooked. We argue we have now reached an inflection point in the field of ML in which attention to neglected data quality is poised to significantly accelerate progress. Toward this end, we advocate for research defining and creating processes to achieve data excellence. We highlight examples, case-studies, and methodologies. This will enable the necessary change in our research culture to value excellence in data practices, which is a critical milestone on the road to enabling the next generation of breakthroughs in ML and AI. View details
    Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study
    Dr. Paisan Raumviboonsuk
    Variya Nganthavee
    Kornwipa Hemarat
    Apinpat Kongprayoon
    Rajiv Raman
    Brian Levinstein
    Roy Lee
    Sunny Virmani
    John Chambers
    Fred Hersch
    Lily Hao Yi Peng
    The Lancet Digital Health (2022)
    Preview abstract Background: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. Methods: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. Findings: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0–96·2), sensitivity of 91·4% (87·1–95·0), and specificity of 95·4% (94·1–96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7–95·0; p=0·17), a sensitivity of 84·8% (79·4–90·0; p=0·024), and specificity of 95·5% (94·1–96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8–84·3) compared with 75·6 (69·8–81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8–97·9) compared with 92·4 (89·3–95·5) for the over-readers. Interpretation: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. Funding: Google and Rajavithi Hospital, Bangkok, Thailand. View details
    Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
    Abigail E. Huang
    ACM CHI Conference on Human Factors in Computing Systems (CHI 2020) (2020) (to appear)
    Preview abstract Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten non-specialist eye care professionals with either individual performance alone, or individual performance and expert discussions. Compared to performance feedback alone, seeing expert discussions significantly improved non-specialists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis. View details
    Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
    Jirawut Limwattanayingyong
    Variya Nganthavee
    Kasem Seresirikachorn
    Tassapol Singalavanija
    Ngamphol Soonthornworasiri
    Varis Ruamviboonsuk
    Chetan Rao
    Rajiv Raman
    Andrzej Grzybowski
    Lily Hao Yi Peng
    Fred Hersch
    Richa Tiwari, PhD
    Dr. Paisan Raumviboonsuk
    Journal of Diabetes Research (2020)
    Preview abstract Objective. To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. Methods. We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. Results. There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). Conclusion. On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings. View details
    Preview abstract Purpose To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. Design Development and validation of an algorithm. Participants Fundus images from screening programs, studies, and a glaucoma clinic. Methods A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. Main Outcome Measures The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. Results The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. Conclusions A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup. View details
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    No Results Found