David Steiner

Health/Pathology team. Integration of molecular, laboratory, and imaging data. Background in molecular biology and molecular diagnostic test development, including analytical and clinical validation.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications. View details
    Predicting lymph node metastasis from primary tumor histology and clinicopathologic factors in colorectal cancer using deep learning
    Fraser Tan
    Isabelle Flament-Auvigne
    Trissia Brown
    Markus Plass
    Robert Reihs
    Heimo Mueller
    Kurt Zatloukal
    Pema Richeson
    Lily Peng
    Craig Mermel
    Cameron Chen
    Saurabh Gombar
    Thomas Montine
    Jeanne Shen
    Nature Communications Medicine, 3 (2023), pp. 59
    Preview abstract Background: Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors. Methods: Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables. Results: The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III). Conclusion: This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts. View details
    Deep learning models for histologic grading of breast cancer and association with disease prognosis
    Trissia Brown
    Isabelle Flament
    Fraser Tan
    Yuannan Cai
    Kunal Nagpal
    Emad Rakha
    David J. Dabbs
    Niels Olson
    James H. Wren
    Elaine E. Thompson
    Erik Seetao
    Carrie Robinson
    Melissa Miao
    Fabien Beckers
    Lily Hao Yi Peng
    Craig Mermel
    Cameron Chen
    npj Breast Cancer (2022)
    Preview abstract Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we developed deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We then evaluated the prognostic potential of these models using an external test set and progression free interval as the primary outcome. The individual component models performed at or above published benchmarks for algorithm-based grading approaches and achieved high concordance rates in comparison to pathologist grading. Prognostic performance of histologic scoring provided by the deep learning-based grading was on par with that of pathologists performing review of matched slides. Additionally, by providing scores for each component feature, the deep-learning based approach provided the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models as well as opportunities to improve access to consistent grading and better understand the links between histologic features and clinical outcomes in breast cancer. View details
    Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
    Wouter Bulten
    Kimmo Kartasalo
    Po-Hsuan Cameron Chen
    Peter Ström
    Hans Pinckaers
    Kunal Nagpal
    Yuannan Cai
    Hester van Boven
    Robert Vink
    Christina Hulsbergen-van de Kaa
    Jeroen van der Laak
    Mahul B. Amin
    Andrew J. Evans
    Theodorus van der Kwast
    Robert Allan
    Peter A. Humphrey
    Henrik Grönberg
    Hemamali Samaratunga
    Brett Delahunt
    Toyonori Tsuzuki
    Tomi Häkkinen
    Lars Egevad
    Maggie Demkin
    Sohier Dane
    Fraser Tan
    Masi Valkonen
    Lily Peng
    Craig H. Mermel
    Pekka Ruusuvuori
    Geert Litjens
    Martin Eklund
    the PANDA challenge consortium
    Nature Medicine, 28 (2022), pp. 154-163
    Preview abstract Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge—the largest histopathology competition to date, joined by 1,290 developers—to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840–0.884) and 0.868 (95% CI, 0.835–0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials. View details
    Onboarding Materials as Cross-functional Boundary Objects for Developing AI Assistants
    Lauren Wilcox
    Samantha Winter
    Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, ACM (2021) (to appear)
    Preview abstract Deep neural networks (DNNs) routinely achieve state-of-the-art performance in a wide range of tasks. This case study reports on the development of onboarding (i.e., training) materials for a DNN-based medical AI Assistant to aid in the grading of prostate cancer. Specifically, we describe how the process of developing these materials deepened the team's understanding of end-user requirements, leading to changes in the development and assessment of the underlying machine learning model. In this sense, the onboarding materials served as a useful boundary object for a cross-functional team. We also present evidence of the utility of the subsequent onboarding materials by describing which information was found useful by participants in an experimental study. View details
    Determining Breast Cancer Biomarker Status and Associated Morphological Features Using Deep Learning
    Paul Gamble
    Harry Wang
    Fraser Tan
    Melissa Moran
    Trissia Brown
    Isabelle Flament
    Emad A. Rakha
    Michael Toss
    David J. Dabbs
    Peter Regitnig
    Niels Olson
    James H. Wren
    Carrie Robinson
    Lily Peng
    Craig Mermel
    Cameron Chen
    Nature Communications Medicine (2021)
    Preview abstract Background: Breast cancer management depends on biomarkers including estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (ER/PR/HER2). Though existing scoring systems are widely used and well-validated, they can involve costly preparation and variable interpretation. Additionally, discordances between histology and expected biomarker findings can prompt repeat testing to address biological, interpretative, or technical reasons for unexpected results. Methods: We developed three independent deep learning systems (DLS) to directly predict ER/PR/HER2 status for both focal tissue regions (patches) and slides using hematoxylin-andeosin-stained (H&E) images as input. Models were trained and evaluated using pathologist annotated slides from three data sources. Areas under the receiver operator characteristic curve (AUCs) were calculated for test sets at both a patch-level (>135 million patches, 181 slides) and slide-level (n = 3274 slides, 1249 cases, 37 sites). Interpretability analyses were performed using Testing with Concept Activation Vectors (TCAV), saliency analysis, and pathologist review of clustered patches. Results: The patch-level AUCs are 0.939 (95%CI 0.936–0.941), 0.938 (0.936–0.940), and 0.808 (0.802–0.813) for ER/PR/HER2, respectively. At the slide level, AUCs are 0.86 (95% CI 0.84–0.87), 0.75 (0.73–0.77), and 0.60 (0.56–0.64) for ER/PR/HER2, respectively. Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining. Conclusions: This study presents rapid breast cancer biomarker estimation from routine H&E slides and builds on prior advances by prioritizing interpretability of computationally learned features in the context of existing pathological knowledge. View details
    Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading
    Kunal Nagpal
    Matthew Symonds
    Melissa Moran
    Markus Plass
    Robert Reihs
    Farah Nader
    Fraser Tan
    Yuannan Cai
    Trissia Brown
    Isabelle Flament
    Mahul Amin
    Martin Stumpe
    Heimo Muller
    Peter Regitnig
    Andreas Holzinger
    Lily Hao Yi Peng
    Cameron Chen
    Kurt Zatloukal
    Craig Mermel
    Communications Medicine (2021)
    Preview abstract Background. Gleason grading of prostate cancer is an important prognostic factor, but suffers from poor reproducibility, particularly among non-subspecialist pathologists. Although artificial intelligence (A.I.) tools have demonstrated Gleason grading on-par with expert pathologists, it remains an open question whether and to what extent A.I. grading translates to better prognostication. Methods. In this study, we developed a system to predict prostate cancer-specific mortality via A.I.-based Gleason grading and subsequently evaluated its ability to risk-stratify patients on an independent retrospective cohort of 2807 prostatectomy cases from a single European center with 5–25 years of follow-up (median: 13, interquartile range 9–17). Results. Here, we show that the A.I.’s risk scores produced a C-index of 0.84 (95% CI 0.80–0.87) for prostate cancer-specific mortality. Upon discretizing these risk scores into risk groups analogous to pathologist Grade Groups (GG), the A.I. has a C-index of 0.82 (95% CI 0.78–0.85). On the subset of cases with a GG provided in the original pathology report (n = 1517), the A.I.’s C-indices are 0.87 and 0.85 for continuous and discrete grading, respectively, compared to 0.79 (95% CI 0.71–0.86) for GG obtained from the reports. These represent improvements of 0.08 (95% CI 0.01–0.15) and 0.07 (95% CI 0.00–0.14), respectively. Conclusions. Our results suggest that A.I.-based Gleason grading can lead to effective risk stratification, and warrants further evaluation for improving disease management. View details
    Interpretable Survival Prediction for Colorectal Cancer using Deep Learning
    Melissa Moran
    Markus Plass
    Robert Reihs
    Fraser Tan
    Isabelle Flament
    Trissia Brown
    Peter Regitnig
    Cameron Chen
    Apaar Sadhwani
    Bob MacDonald
    Benny Ayalew
    Lily Hao Yi Peng
    Heimo Mueller
    Zhaoyang Xu
    Martin Stumpe
    Kurt Zatloukal
    Craig Mermel
    npj Digital Medicine (2021)
    Preview abstract Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies. View details
    Comparative analysis of machine learning approaches to classify tumor mutation burden in lung adenocarcinoma using histopathology images
    Apaar Sadhwani
    Huang-Wei Chang
    Ali Behrooz
    Trissia Brown
    Isabelle Flament
    Hardik Patel
    Robert Findlater
    Vanessa Velez
    Fraser Tan
    Kamilla Marta Tekiela
    Eunhee Yi
    Craig Mermel
    Debra Hanks
    Cameron Chen
    Kimary Kulig
    Cory Batenchuk
    Peter Cimermancic
    Scientific Reports (2021)
    Preview abstract Both histologic subtype and tumor mutation burden (TMB) represent important biomarkers in lung cancer, with implications for patient prognosis as well as treatment decisions. Typically, TMB is evaluated by comprehensive genomic profiling but this requires use of finite tissue specimens as well as costly and time consuming laboratory processes. Histologic subtype classification represents an established component of lung adenocarcinoma histopathology, but it can be a challenging task with substantial inter-pathologist variability. Here we developed a deep learning system to both classify histologic patterns in lung adenocarcinoma and predict TMB status using Hematoxylin and Eosin (H&E) stained whole slide images. We first trained a convolutional neural network to comprehensively infer histologic subtypes across whole slide images of lung cancer resection specimens. This model achieved a patch-level area under the receiver operating characteristic curve (AUROC) of 0.78-0.98 for the individual features on a test including TCGA slides and 50 external dataset slides. We then integrated the output of this model with clinico-demographic data to develop an interpretable model for TMB classification and evaluated the end-to-end system on 172 held out cases from TCGA, achieving an AUROC of 0.71 [95%CI 0.62-0.79]. Finally we also developed a weakly supervised model for TMB classification, finding that our histologic subtype-based approach achieves similar performance (AUROC of 0.72 95% CI XXX) to the weakly supervised approach. These results suggest interpretable approaches for molecular biomarker prediction based on established histologic patterns are feasible and comparable to more difficult to explain deep learning approaches. View details
    Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer From Biopsy Specimens
    Kunal Nagpal
    Davis Foote
    Fraser Tan
    Cameron Chen
    Naren Manoj
    Niels Olson
    Jenny Smith
    Arash Mohtashamian
    Brandon Peterson
    Mahul Amin
    Andrew Evans
    Joan Sweet
    Carol Cheung
    Theodorus van der Kwast
    Ankur Sangoi
    Ming Zhou
    Robert W. Allan
    Peter A Humphrey
    Jason Hipp
    Krishna Kumar Gadepalli
    Lily Hao Yi Peng
    Martin Stumpe
    Craig Mermel
    JAMA Oncology (2020)
    Preview abstract Importance: For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools to improve the reproducibility of Gleason grading in routine clinical practice. Objective: To evaluate the ability of a deep learning system (DLS) to grade diagnostic prostate biopsy specimens. Design, Setting, and Participants: The DLS was evaluated using 752 deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens obtained from 3 institutions in the United States, including 1 institution not used for DLS development. To obtain the Gleason grade group (GG), each specimen was first reviewed by 2 expert urologic subspecialists from a multi-institutional panel of 6 individuals (years of experience: mean, 25 years; range, 18-34 years). A third subspecialist reviewed discordant cases to arrive at a majority opinion. To reduce diagnostic uncertainty, all subspecialists had access to an immunohistochemical-stained section and 3 histologic sections for every biopsied specimen. Their review was conducted from December 2018 to June 2019. Main Outcomes and Measures: The frequency of the exact agreement of the DLS with the majority opinion of the subspecialists in categorizing each tumor-containing specimen as 1 of 5 categories: nontumor, GG1, GG2, GG3, or GG4-5. For comparison, the rate of agreement of 19 general pathologists’ opinions with the subspecialists’ majority opinions was also evaluated. Results: For grading tumor-containing biopsy specimens in the validation set (n = 498), the rate of agreement with subspecialists was significantly higher for the DLS (71.7%; 95% CI, 67.9%-75.3%) than for general pathologists (58.0%; 95% CI, 54.5%-61.4%) (P < .001). In subanalyses of biopsy specimens from an external validation set (n = 322), the Gleason grading performance of the DLS remained similar. For distinguishing nontumor from tumor-containing biopsy specimens (n = 752), the rate of agreement with subspecialists was 94.3% (95% CI, 92.4%-95.9%) for the DLS and similar at 94.7% (95% CI, 92.8%-96.3%) for general pathologists (P = .58). Conclusions and Relevance: In this study, the DLS showed higher proficiency than general pathologists at Gleason grading prostate needle core biopsy specimens and generalized to an independent institution. Future research is necessary to evaluate the potential utility of using the DLS as a decision support tool in clinical workflows and to improve the quality of prostate cancer grading for therapy decisions. View details