Daniel Tse

Daniel Tse

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
    Atilla Kiraly
    Corbin Cunningham
    Ryan Najafi
    Jie Yang
    Chuck Lau
    Diego Ardila
    Scott Mayer McKinney
    Rory Pilgrim
    Mozziyar Etemadi
    Sunny Jansen
    Lily Peng
    Shravya Shetty
    Neeral Beladia
    Krish Eswaran
    Radiology: Artificial Intelligence(2024)
    Preview abstract Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity. View details
    Development of a Machine Learning Model for Sonographic Assessment of Gestational Age
    Chace Lee
    Angelica Willis
    Christina Chen
    Amber Watters
    Bethany Stetson
    Akib Uddin
    Jonny Wong
    Rory Pilgrim
    Kat Chou
    Shravya Ramesh Shetty
    Ryan Gomes
    JAMA Network Open(2023)
    Preview abstract Importance: Fetal ultrasonography is essential for confirmation of gestational age (GA), and accurate GA assessment is important for providing appropriate care throughout pregnancy and for identifying complications, including fetal growth disorders. Derivation of GA from manual fetal biometry measurements (ie, head, abdomen, and femur) is operator dependent and time-consuming. Objective: To develop artificial intelligence (AI) models to estimate GA with higher accuracy and reliability, leveraging standard biometry images and fly-to ultrasonography videos. Design, Setting, and Participants: To improve GA estimates, this diagnostic study used AI to interpret standard plane ultrasonography images and fly-to ultrasonography videos, which are 5- to 10-second videos that can be automatically recorded as part of the standard of care before the still image is captured. Three AI models were developed and validated: (1) an image model using standard plane images, (2) a video model using fly-to videos, and (3) an ensemble model (combining both image and video models). The models were trained and evaluated on data from the Fetal Age Machine Learning Initiative (FAMLI) cohort, which included participants from 2 study sites at Chapel Hill, North Carolina (US), and Lusaka, Zambia. Participants were eligible to be part of this study if they received routine antenatal care at 1 of these sites, were aged 18 years or older, had a viable intrauterine singleton pregnancy, and could provide written consent. They were not eligible if they had known uterine or fetal abnormality, or had any other conditions that would make participation unsafe or complicate interpretation. Data analysis was performed from January to July 2022. Main Outcomes and Measures: The primary analysis outcome for GA was the mean difference in absolute error between the GA model estimate and the clinical standard estimate, with the ground truth GA extrapolated from the initial GA estimated at an initial examination. Results: Of the total cohort of 3842 participants, data were calculated for a test set of 404 participants with a mean (SD) age of 28.8 (5.6) years at enrollment. All models were statistically superior to standard fetal biometry–based GA estimates derived from images captured by expert sonographers. The ensemble model had the lowest mean absolute error compared with the clinical standard fetal biometry (mean [SD] difference, −1.51 [3.96] days; 95% CI, −1.90 to −1.10 days). All 3 models outperformed standard biometry by a more substantial margin on fetuses that were predicted to be small for their GA. Conclusions and Relevance: These findings suggest that AI models have the potential to empower trained operators to estimate GA with higher accuracy. View details
    ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
    Shawn Xu
    Lin Yang
    Timo Kohlberger
    Martin Ma
    Atilla Kiraly
    Sahar Kazemzadeh
    Zakkai Melamed
    Jungyeon Park
    Patricia MacWilliams
    Chuck Lau
    Preeti Singh
    Christina Chen
    Mozziyar Etemadi
    Sreenivasa Raju Kalidindi
    Kat Chou
    Shravya Shetty
    Daniel Golden
    Rory Pilgrim
    Krish Eswaran
    arxiv(2023)
    Preview abstract Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI. View details
    A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
    Ryan Gomes
    Bellington Vwalika
    Chace Lee
    Angelica Willis
    Joan T. Price
    Christina Chen
    Margaret P. Kasaro
    James A. Taylor
    Elizabeth M. Stringer
    Scott Mayer McKinney
    Ntazana Sindano
    George Edward Dahl
    William Goodnight, III
    Justin Gilmer
    Benjamin H. Chi
    Charles Lau
    Terry Spitz
    Kris Liu
    Jonny Wong
    Rory Pilgrim
    Akib Uddin
    Lily Hao Yi Peng
    Kat Chou
    Jeffrey S. A. Stringer
    Shravya Ramesh Shetty
    Communications Medicine(2022)
    Preview abstract Background Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep. Conclusions The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings. View details
    Simplified Transfer Learning for Chest X-ray Models using Less Data
    Christina Chen
    AJ Maschinot
    Jenny Huang
    Chuck Lau
    Sreenivasa Raju Kalidindi
    Mozziyar Etemadi
    Florencia Garcia-Vicente
    David Melnick
    Krish Eswaran
    Neeral Beladia
    Dilip Krishnan
    Shravya Ramesh Shetty
    Radiology(2022)
    Preview abstract Background: Developing deep learning models for radiology requires large data sets and substantial computational resources. Data set size limitations can be further exacerbated by distribution shifts, such as rapid changes in patient populations and standard of care during the COVID-19 pandemic. A common partial mitigation is transfer learning by pretraining a “generic network” on a large nonmedical data set and then fine-tuning on a task-specific radiology data set. Purpose: To reduce data set size requirements for chest radiography deep learning models by using an advanced machine learning approach (supervised contrastive [SupCon] learning) to generate chest radiography networks. Materials and Methods: SupCon helped generate chest radiography networks from 821 544 chest radiographs from India and the United States. The chest radiography networks were used as a starting point for further machine learning model development for 10 prediction tasks (eg, airspace opacity, fracture, tuberculosis, and COVID-19 outcomes) by using five data sets comprising 684 955 chest radiographs from India, the United States, and China. Three model development setups were tested (linear classifier, nonlinear classifier, and fine-tuning the full network) with different data set sizes from eight to 85. Results: Across a majority of tasks, compared with transfer learning from a nonmedical data set, SupCon reduced label requirements up to 688-fold and improved the area under the receiver operating characteristic curve (AUC) at matching data set sizes. At the extreme low-data regimen, training small nonlinear models by using only 45 chest radiographs yielded an AUC of 0.95 (noninferior to radiologist performance) in classifying microbiology-confirmed tuberculosis in external validation. At a more moderate data regimen, training small nonlinear models by using only 528 chest radiographs yielded an AUC of 0.75 in predicting severe COVID-19 outcomes. Conclusion: Supervised contrastive learning enabled performance comparable to state-of-the-art deep learning models in multiple clinical tasks by using as few as 45 images and is a promising method for predictive modeling with use of small data sets and for predicting outcomes in shifting patient populations. View details
    Deep Learning Detection of Active Pulmonary Tuberculosis at Chest Radiography Matched the Clinical Performance of Radiologists
    Sahar Kazemzadeh
    Jin Yu
    Shahar Jamshy
    Rory Pilgrim
    Christina Chen
    Neeral Beladia
    Chuck Lau
    Scott Mayer McKinney
    Thad Hughes
    Atilla Peter Kiraly
    Sreenivasa Raju Kalidindi
    Monde Muyoyeta
    Jameson Malemela
    Ting Shih
    Lily Hao Yi Peng
    Kat Chou
    Cameron Chen
    Krish Eswaran
    Shravya Ramesh Shetty
    Radiology(2022)
    Preview abstract Background: The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose: To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods: A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning (“noisy-student”) were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results: A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%–80% per TB-positive patient detected. Conclusion: A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs. View details
    Improving reference standards for validation of AI-based radiography
    Gavin Duggan
    Joshua Reicher
    Shravya Shetty
    British Journal of Radiology(2021)
    Preview abstract Objective: Demonstrate the importance of combining multiple readers' opinions, in a context-aware manner, when establishing the reference standard for validation of artificial intelligence (AI) applications for, e.g. chest radiographs. By comparing individual readers, majority vote of a panel, and panel-based discussion, we identify methods which maximize interobserver agreement and label reproducibility. Methods: 1100 frontal chest radiographs were evaluated for 6 findings: airspace opacity, cardiomegaly, pulmonary edema, fracture, nodules, and pneumothorax. Each image was reviewed by six radiologists, first individually and then via asynchronous adjudication (web-based discussion) in two panels of three readers to resolve disagreements within each panel. We quantified the reproducibility of each method by measuring interreader agreement. Results: Panel-based majority vote improved agreement relative to individual readers for all findings. Most disagreements were resolved with two rounds of adjudication, which further improved reproducibility for some findings, particularly reducing misses. Improvements varied across finding categories, with adjudication improving agreement for cardiomegaly, fractures, and pneumothorax. Conclusion: The likelihood of interreader agreement, even within panels of US board-certified radiologists, must be considered before reads can be used as a reference standard for validation of proposed AI tools. Agreement and, by extension, reproducibility can be improved by applying majority vote, maximum sensitivity, or asynchronous adjudication for different findings, which supports the development of higher quality clinical research. View details
    Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19
    Shahar Jamshy
    Charles Lau
    Eddie Santos
    Atilla Peter Kiraly
    Jie Yang
    Rory Pilgrim
    Sahar Kazemzadeh
    Jin Yu
    Lily Hao Yi Peng
    Krish Eswaran
    Neeral Beladia
    Cameron Chen
    Shravya Ramesh Shetty
    Scientific Reports(2021)
    Preview abstract Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to detect every possible condition by building multiple separate systems, each of which detects one or more pre-specified conditions. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For training and tuning the system, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system trained using a large dataset containing a diverse array of CXR abnormalities generalizes to new patient populations and unseen diseases. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7–28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist. Lastly, to facilitate the continued development of AI models for CXR, we release our collected labels for the publicly available dataset. View details
    Interpretable Survival Prediction for Colorectal Cancer using Deep Learning
    Melissa Moran
    Markus Plass
    Robert Reihs
    Fraser Tan
    Isabelle Flament
    Trissia Brown
    Peter Regitnig
    Cameron Chen
    Apaar Sadhwani
    Bob MacDonald
    Benny Ayalew
    Lily Hao Yi Peng
    Heimo Mueller
    Zhaoyang Xu
    Martin Stumpe
    Kurt Zatloukal
    Craig Mermel
    npj Digital Medicine(2021)
    Preview abstract Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies. View details
    International evaluation of an AI system for breast cancer screening
    Scott Mayer McKinney
    Varun Yatindra Godbole
    Jonathan Godwin
    Natasha Antropova
    Hutan Ashrafian
    Trevor John Back
    Mary Chesus
    Ara Darzi
    Mozziyar Etemadi
    Florencia Garcia-Vicente
    Fiona J Gilbert
    Mark D Halling-Brown
    Demis Hassabis
    Sunny Jansen
    Dominic King
    David Melnick
    Hormuz Mostofi
    Lily Hao Yi Peng
    Joshua Reicher
    Bernardino Romera Paredes
    Richard Sidebottom
    Mustafa Suleyman
    Kenneth C. Young
    Jeffrey De Fauw
    Shravya Ramesh Shetty
    Nature(2020)
    Preview abstract Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening. View details
    End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography
    Diego Ardila
    Atilla Peter Kiraly
    Sujeeth Bharadwaj
    Bokyung Choi
    Joshua Reicher
    Lily Peng
    Mozziyar Etemadi
    David Naidich
    Shravya Ramesh Shetty
    Nature Medicine(2019)
    Preview abstract With an estimated 160,000 deaths in 2018, lung cancer is the most common cause of cancer death in the United States1. Lung cancer screening using low-dose computed tomography has been shown to reduce mortality by 20–43% and is now included in US screening guidelines. Existing challenges include inter-grader variability and high false-positive and false-negative rates. We propose a deep learning algorithm that uses a patient’s current and prior computed tomography volumes to predict the risk of lung cancer. Our model achieves a state-of-the-art performance (94.4% area under the curve) on 6,716 National Lung Cancer Screening Trial cases, and performs similarly on an independent clinical validation set of 1,139 cases. We conducted two reader studies. When prior computed tomography imaging was not available, our model outperformed all six radiologists with absolute reductions of 11% in false positives and 5% in false negatives. Where prior computed tomography imaging was available, the model performance was on-par with the same radiologists. This creates an opportunity to optimize the screening process via computer assistance and automation. While the vast majority of patients remain unscreened, we show the potential for deep learning models to increase the accuracy, consistency and adoption of lung cancer screening worldwide. View details
    Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation
    Anna Dagna Majkowska
    Sid Mittal
    Joshua Reicher
    Scott Mayer McKinney
    Gavin Duggan
    Krish Eswaran
    Cameron Chen
    Sreenivasa Raju Kalidindi
    Alexander Ding
    Shravya Ramesh Shetty
    Radiology(2019)
    Preview abstract Background Deep learning has the potential to augment the use of chest radiography in clinical radiology, but challenges include poor generalizability, spectrum bias, and difficulty comparing across studies. Purpose To develop and evaluate deep learning models for chest radiograph interpretation by using radiologist-adjudicated reference standards. Materials and Methods Deep learning models were developed to detect four findings (pneumothorax, opacity, nodule or mass, and fracture) on frontal chest radiographs. This retrospective study used two data sets. Data set 1 (DS1) consisted of 759 611 images from a multicity hospital network and ChestX-ray14 is a publicly available data set with 112 120 images. Natural language processing and expert review of a subset of images provided labels for 657 954 training images. Test sets consisted of 1818 and 1962 images from DS1 and ChestX-ray14, respectively. Reference standards were defined by radiologist-adjudicated image review. Performance was evaluated by area under the receiver operating characteristic curve analysis, sensitivity, specificity, and positive predictive value. Four radiologists reviewed test set images for performance comparison. Inverse probability weighting was applied to DS1 to account for positive radiograph enrichment and estimate population-level performance. Results In DS1, population-adjusted areas under the receiver operating characteristic curve for pneumothorax, nodule or mass, airspace opacity, and fracture were, respectively, 0.95 (95% confidence interval [CI]: 0.91, 0.99), 0.72 (95% CI: 0.66, 0.77), 0.91 (95% CI: 0.88, 0.93), and 0.86 (95% CI: 0.79, 0.92). With ChestX-ray14, areas under the receiver operating characteristic curve were 0.94 (95% CI: 0.93, 0.96), 0.91 (95% CI: 0.89, 0.93), 0.94 (95% CI: 0.93, 0.95), and 0.81 (95% CI: 0.75, 0.86), respectively. Conclusion Expert-level models for detecting clinically relevant chest radiograph findings were developed for this study by using adjudicated reference standards and with population-level performance estimation. Radiologist-adjudicated labels for 2412 ChestX-ray14 validation set images and 1962 test set images are provided. View details
    Improving the specificity of lung cancer screening CT using deep learning
    Diego Ardila
    Bokyung Choi
    Atilla Peter Kiraly
    Sujeeth Bharadwaj
    Joshua Reicher
    Lily Peng
    Shravya Ramesh Shetty
    RSNA(2018)
    Preview abstract PURPOSE Evaluate the utility of deep learning to improve the specificity and sensitivity of lung cancer screening with low-dose helical computed tomography (LDCT), relative to the Lung-RADS guidelines. METHOD AND MATERIALS We analyzed 42,943 CT studies from 14,863 patients, 620 of which developed biopsy-confirmed cancer. All cases were from the National Lung Screening Trial (NLST) study. We randomly split patients into a training (70%), tuning (15%) and test (15%) sets. A study was marked "true" if the patient was diagnosed with biopsy confirmed lung cancer in the same screening year as the study.A deep learning model was trained over 3D CT volumes (400x512x512) as input. We used the 95% specificity operating point based on the tuning set, and evaluated our approach on the test set. To estimate radiologist performance, we retrospectively applied Lung-RADS criteria to each study in the test set. Lung-RADS categories 1 to 2 constitute negative screening results, and categories 3 to 4 constitute positive results. Neither the model nor the Lung-RADS results took into account prior studies, but all screening years were utilized in evaluation. RESULTS The area under the receiver operator curve of the deep learning model was 94.2% (95% CI 91.0, 96.9). Compared to Lung-RADS on the test set, the trained model achieved a statistically significant absolute 9.2% (95% CI 8.4, 10.1) higher specificity and trended a 3.4% (95% CI -5.2, 12.6) higher sensitivity (not statistically significant).Radiologists qualitatively reviewed disagreements between the model and Lung-RADS. Preliminary analysis suggests that the model may be superior in distinguishing scarring from early malignancy. CONCLUSION A deep learning based model improved the specificity of lung cancer screening over Lung-RADS on the NLST dataset and could potentially help reduce unnecessary procedures. This research could supplement future versions of Lung-RADS; or support assisted read or second read workflows. CLINICAL RELEVANCE/APPLICATION While Lung-RADS criteria is recommended for lung cancer screening with LDCT, there is still an opportunity to reduce false-positive rates which lead to unnecessary invasive procedures. View details
    No Results Found