David Steiner
Health/Pathology team. Integration of molecular, laboratory, and imaging data. Background in molecular biology and molecular diagnostic test development, including analytical and clinical validation.
Authored Publications
Sort By
Health AI Developer Foundations
Atilla Kiraly
Sebastien Baur
Kenneth Philbrick
Fereshteh Mahvar
Liron Yatziv
Tiffany Chen
Bram Sterling
Nick George
Fayaz Jamil
Jing Tang
Kai Bailey
Faruk Ahmed
Akshay Goel
Abbi Ward
Lin Yang
Shravya Shetty
Daniel Golden
Tim Thelin
Rory Pilgrim
Can "John" Kirmizi
arXiv (2024)
Preview abstract
Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer Foundations (HAI-DEF), a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building ML for health applications. The models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio. These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs compared to traditional approaches. In addition, we utilize a common interface and style across these models, and prioritize usability to enable developers to integrate HAI-DEF efficiently. We present model evaluations across various tasks and conclude with a discussion of their application and evaluation, covering the importance of ensuring efficacy, fairness, and equity. Finally, while HAI-DEF and specifically the foundation models lower the barrier to entry for ML in healthcare, we emphasize the importance of validation with problem- and population-specific data for each desired usage setting. This technical report will be updated over time as more modalities and features are added.
View details
Domain-specific optimization and diverse evaluation of self-supervised models for histopathology
Jeremy Lai
Faruk Ahmed
Supriya Vijay
Jessica Loo
Saurabh Vyawahare
Saloni Agarwal
Fayaz Jamil
Cameron Chen
arXiv (2023)
Preview abstract
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.
View details
Predicting lymph node metastasis from primary tumor histology and clinicopathologic factors in colorectal cancer using deep learning
Fraser Tan
Isabelle Flament-Auvigne
Trissia Brown
Markus Plass
Robert Reihs
Heimo Mueller
Kurt Zatloukal
Pema Richeson
Lily Peng
Craig Mermel
Cameron Chen
Saurabh Gombar
Thomas Montine
Jeanne Shen
Nature Communications Medicine, 3 (2023), pp. 59
Preview abstract
Background: Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors.
Methods: Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables.
Results: The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III).
Conclusion: This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts.
View details
Deep learning models for histologic grading of breast cancer and association with disease prognosis
Trissia Brown
Isabelle Flament
Fraser Tan
Yuannan Cai
Kunal Nagpal
Emad Rakha
David J. Dabbs
Niels Olson
James H. Wren
Elaine E. Thompson
Erik Seetao
Carrie Robinson
Melissa Miao
Fabien Beckers
Lily Hao Yi Peng
Craig Mermel
Cameron Chen
npj Breast Cancer (2022)
Preview abstract
Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we developed deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We then evaluated the prognostic potential of these models using an external test set and progression free interval as the primary outcome. The individual component models performed at or above published benchmarks for algorithm-based grading approaches and achieved high concordance rates in comparison to pathologist grading. Prognostic performance of histologic scoring provided by the deep learning-based grading was on par with that of pathologists performing review of matched slides. Additionally, by providing scores for each component feature, the deep-learning based approach provided the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models as well as opportunities to improve access to consistent grading and better understand the links between histologic features and clinical outcomes in breast cancer.
View details
Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
Wouter Bulten
Kimmo Kartasalo
Po-Hsuan Cameron Chen
Peter Ström
Hans Pinckaers
Kunal Nagpal
Yuannan Cai
Hester van Boven
Robert Vink
Christina Hulsbergen-van de Kaa
Jeroen van der Laak
Mahul B. Amin
Andrew J. Evans
Theodorus van der Kwast
Robert Allan
Peter A. Humphrey
Henrik Grönberg
Hemamali Samaratunga
Brett Delahunt
Toyonori Tsuzuki
Tomi Häkkinen
Lars Egevad
Maggie Demkin
Sohier Dane
Fraser Tan
Masi Valkonen
Lily Peng
Craig H. Mermel
Pekka Ruusuvuori
Geert Litjens
Martin Eklund
the PANDA challenge consortium
Nature Medicine, 28 (2022), pp. 154-163
Preview abstract
Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge—the largest histopathology competition to date, joined by 1,290 developers—to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840–0.884) and 0.868 (95% CI, 0.835–0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials.
View details
Onboarding Materials as Cross-functional Boundary Objects for Developing AI Assistants
Lauren Wilcox
Samantha Winter
Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, ACM (2021) (to appear)
Preview abstract
Deep neural networks (DNNs) routinely achieve state-of-the-art performance in a wide range of tasks. This case study reports on the development of onboarding (i.e., training) materials for a DNN-based medical AI Assistant to aid in the grading of prostate cancer. Specifically, we describe how the process of developing these materials deepened the team's understanding of end-user requirements, leading to changes in the development and assessment of the underlying machine learning model. In this sense, the onboarding materials served as a useful boundary object for a cross-functional team. We also present evidence of the utility of the subsequent onboarding materials by describing which information was found useful by participants in an experimental study.
View details
Determining Breast Cancer Biomarker Status and Associated Morphological Features Using Deep Learning
Paul Gamble
Harry Wang
Fraser Tan
Melissa Moran
Trissia Brown
Isabelle Flament
Emad A. Rakha
Michael Toss
David J. Dabbs
Peter Regitnig
Niels Olson
James H. Wren
Carrie Robinson
Lily Peng
Craig Mermel
Cameron Chen
Nature Communications Medicine (2021)
Preview abstract
Background: Breast cancer management depends on biomarkers including estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (ER/PR/HER2). Though existing scoring systems are widely used and well-validated, they can involve costly preparation and variable interpretation. Additionally, discordances between histology and expected biomarker findings can prompt repeat testing to address biological, interpretative, or technical reasons for unexpected results.
Methods: We developed three independent deep learning systems (DLS) to directly predict ER/PR/HER2 status for both focal tissue regions (patches) and slides using hematoxylin-andeosin-stained (H&E) images as input. Models were trained and evaluated using pathologist annotated slides from three data sources. Areas under the receiver operator characteristic curve (AUCs) were calculated for test sets at both a patch-level (>135 million patches, 181 slides) and slide-level (n = 3274 slides, 1249 cases, 37 sites). Interpretability analyses were performed using Testing with Concept Activation Vectors (TCAV), saliency analysis, and pathologist review of clustered patches.
Results: The patch-level AUCs are 0.939 (95%CI 0.936–0.941), 0.938 (0.936–0.940), and 0.808 (0.802–0.813) for ER/PR/HER2, respectively. At the slide level, AUCs are 0.86 (95% CI 0.84–0.87), 0.75 (0.73–0.77), and 0.60 (0.56–0.64) for ER/PR/HER2, respectively. Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining.
Conclusions: This study presents rapid breast cancer biomarker estimation from routine H&E slides and builds on prior advances by prioritizing interpretability of computationally learned features in the context of existing pathological knowledge.
View details
Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading
Kunal Nagpal
Matthew Symonds
Melissa Moran
Markus Plass
Robert Reihs
Farah Nader
Fraser Tan
Yuannan Cai
Trissia Brown
Isabelle Flament
Mahul Amin
Martin Stumpe
Heimo Muller
Peter Regitnig
Andreas Holzinger
Lily Hao Yi Peng
Cameron Chen
Kurt Zatloukal
Craig Mermel
Communications Medicine (2021)
Preview abstract
Background. Gleason grading of prostate cancer is an important prognostic factor, but suffers from poor reproducibility, particularly among non-subspecialist pathologists. Although artificial intelligence (A.I.) tools have demonstrated Gleason grading on-par with expert pathologists, it remains an open question whether and to what extent A.I. grading translates to better prognostication.
Methods. In this study, we developed a system to predict prostate cancer-specific mortality via A.I.-based Gleason grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5–25 years of follow-up (median: 13, interquartile range 9–17).
Results. Here, we show that the A.I.’s risk scores produced a C-index of 0.84 (95% CI 0.80–0.87) for prostate cancer-specific mortality. Upon discretizing these risk scores into risk groups analogous to pathologist Grade Groups (GG), the A.I. has a C-index of 0.82 (95% CI 0.78–0.85). On the subset of cases with a GG provided in the original pathology report (n = 1517), the A.I.’s C-indices are 0.87 and 0.85 for continuous and discrete grading, respectively, compared to 0.79 (95% CI 0.71–0.86) for GG obtained from the reports. These represent improvements of 0.08 (95% CI 0.01–0.15) and 0.07 (95% CI 0.00–0.14), respectively.
Conclusions. Our results suggest that A.I.-based Gleason grading can lead to effective risk stratification, and warrants further evaluation for improving disease management.
View details
Interpretable Survival Prediction for Colorectal Cancer using Deep Learning
Melissa Moran
Markus Plass
Robert Reihs
Fraser Tan
Isabelle Flament
Trissia Brown
Peter Regitnig
Cameron Chen
Apaar Sadhwani
Bob MacDonald
Benny Ayalew
Lily Hao Yi Peng
Heimo Mueller
Zhaoyang Xu
Martin Stumpe
Kurt Zatloukal
Craig Mermel
npj Digital Medicine (2021)
Preview abstract
Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies.
View details
Comparative analysis of machine learning approaches to classify tumor mutation burden in lung adenocarcinoma using histopathology images
Apaar Sadhwani
Huang-Wei Chang
Ali Behrooz
Trissia Brown
Isabelle Flament
Hardik Patel
Robert Findlater
Vanessa Velez
Fraser Tan
Kamilla Marta Tekiela
Eunhee Yi
Craig Mermel
Debra Hanks
Cameron Chen
Kimary Kulig
Cory Batenchuk
Peter Cimermancic
Scientific Reports (2021)
Preview abstract
Both histologic subtype and tumor mutation burden (TMB) represent important biomarkers in lung cancer, with implications for patient prognosis as well as treatment decisions. Typically, TMB is evaluated by comprehensive genomic profiling but this requires use of finite tissue specimens as well as costly and time consuming laboratory processes. Histologic subtype classification represents an established component of lung adenocarcinoma histopathology, but it can be a challenging task with substantial inter-pathologist variability. Here we developed a deep learning system to both classify histologic patterns in lung adenocarcinoma and predict TMB status using Hematoxylin and Eosin (H&E) stained whole slide images. We first trained a convolutional neural network to comprehensively infer histologic subtypes across whole slide images of lung cancer resection specimens. This model achieved a patch-level area under the receiver operating characteristic curve (AUROC) of 0.78-0.98 for the individual features on a test including TCGA slides and 50 external dataset slides. We then integrated the output of this model with clinico-demographic data to develop an interpretable model for TMB classification and evaluated the end-to-end system on 172 held out cases from TCGA, achieving an AUROC of 0.71 [95%CI 0.62-0.79]. Finally we also developed a weakly supervised model for TMB classification, finding that our histologic subtype-based approach achieves similar performance (AUROC of 0.72 95% CI XXX) to the weakly supervised approach. These results suggest interpretable approaches for molecular biomarker prediction based on established histologic patterns are feasible and comparable to more difficult to explain deep learning approaches.
View details