Yun Liu

Yun Liu

Yun is a senior staff research scientist in Google Research. In this role he focuses on developing and validating machine learning for medical applications across multiple fields: pathology, ophthalmology, radiology, dermatology, and more. Yun completed his PhD at Harvard-MIT Health Sciences and Technology, where he worked on predictive risk modeling using biomedical signals, medical text, and billing codes. He has previously also worked on predictive modeling for nucleic acid sequences and protein structures. Yun completed a B.S. in Molecular and Cellular Biology and Computer Science at Johns Hopkins University.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Performance of a Deep Learning Diabetic Retinopathy Algorithm in India
    Arthur Brant
    Xiang Yin
    Lu Yang
    Jay Nayar
    Divleen Jeji
    Sunny Virmani
    Anchintha Meenu
    Naresh Babu Kannan
    Florence Thng
    Lily Peng
    Ramasamy Kim
    JAMA Network Open (2025)
    Preview abstract Importance: While prospective studies have investigated the accuracy of artificial intelligence (AI) for detection of diabetic retinopathy (DR) and diabetic macular edema (DME), to date, little published data exist on the clinical performance of these algorithms. Objective: To evaluate the clinical performance of an automated retinal disease assessment (ARDA) algorithm in the postdeployment setting at Aravind Eye Hospital in India. Design, Setting, and Participants: This cross-sectional analysis involved an approximate 1% sample of fundus photographs from patients screened using ARDA. Images were graded via adjudication by US ophthalmologists for DR and DME, and ARDA’s output was compared against the adjudicated grades at 45 sites in Southern India. Patients were randomly selected between January 1, 2019, and July 31, 2023. Main Outcomes and Measures: Primary analyses were the sensitivity and specificity of ARDA for severe nonproliferative DR (NPDR) or proliferative DR (PDR). Secondary analyses focused on sensitivity and specificity for sight-threatening DR (STDR) (DME or severe NPDR or PDR). Results: Among the 4537 patients with 4537 images with adjudicated grades, mean (SD) age was 55.2 (11.9) years and 2272 (50.1%) were male. Among the 3941 patients with gradable photographs, 683 (17.3%) had any DR, 146 (3.7%) had severe NPDR or PDR, 109 (2.8%) had PDR, and 398 (10.1%) had STDR. ARDA’s sensitivity and specificity for severe NPDR or PDR were 97.0% (95% CI, 92.6%-99.2%) and 96.4% (95% CI, 95.7%-97.0%), respectively. Positive predictive value (PPV) was 50.7% and negative predictive value (NPV) was 99.9%. The clinically important miss rate for severe NPDR or PDR was 0% (eg, some patients with severe NPDR or PDR were interpreted as having moderate DR and referred to clinic). ARDA’s sensitivity for STDR was 95.9% (95% CI, 93.0%-97.4%) and specificity was 94.9% (95% CI, 94.1%-95.7%); PPV and NPV were 67.9% and 99.5%, respectively. Conclusions and Relevance: In this cross-sectional study investigating the clinical performance of ARDA, sensitivity and specificity for severe NPDR and PDR exceeded 96% and caught 100% of patients with severe  NPDR and PDR for ophthalmology referral. This preliminary large-scale postmarketing report of the performance of ARDA after screening 600 000 patients in India underscores the importance of monitoring and publication an algorithm's clinical performance, consistent with recommendations by regulatory bodies. View details
    Oculomics: Current Concepts and Evidence
    Zhuoting Zhu
    Yueye Wang
    Ziyi Qi
    Wenyi Hu
    Xiayin Zhang
    Siegfried Wagner
    Yujie Wang
    An Ran Ran
    Joshua Ong
    Ethan Waisberg
    Mouayad Masalkhi
    Alex Suh
    Yih Chung Tham
    Carol Y. Cheung
    Xiaohong Yang
    Honghua Yu
    Zongyuan Ge
    Wei Wang
    Bin Sheng
    Andrew G. Lee
    Alastair Denniston
    Peter van Wijngaarden
    Pearse Keane
    Ching-Yu Cheng
    Mingguang He
    Tien Yin Wong
    Progress in Retinal and Eye Research (2025)
    Preview abstract The eye provides novel insights into general health, as well as pathogenesis and development of systemic diseases. In the past decade, growing evidence has demonstrated that the eye's structure and function mirror multiple systemic health conditions, especially in cardiovascular diseases, neurodegenerative disorders, and kidney impairments. This has given rise to the field of oculomics- the application of ophthalmic biomarkers to understand mechanisms, detect and predict disease. The development of this field has been accelerated by three major advances: 1) the availability and widespread clinical adoption of high-resolution and non-invasive ophthalmic imaging (“hardware”); 2) the availability of large studies to interrogate associations (“big data”); 3) the development of novel analytical methods, including artificial intelligence (AI) (“software”). Oculomics offers an opportunity to enhance our understanding of the interplay between the eye and the body, while supporting development of innovative diagnostic, prognostic, and therapeutic tools. These advances have been further accelerated by developments in AI, coupled with large-scale linkage datasets linking ocular imaging data with systemic health data. Oculomics also enables the detection, screening, diagnosis, and monitoring of many systemic health conditions. Furthermore, oculomics with AI allows prediction of the risk of systemic diseases, enabling risk stratification, opening up new avenues for prevention or individualized risk prediction and prevention, facilitating personalized medicine. In this review, we summarise current concepts and evidence in the field of oculomics, highlighting the progress that has been made, remaining challenges, and the opportunities for future research. View details
    Validation of a Deep Learning Model for Diabetic Retinopathy on Patients with Young-Onset Diabetes
    Tony Tan-Torres
    Pradeep Praveen
    Divleen Jeji
    Arthur Brant
    Xiang Yin
    Lu Yang
    Tayyeba Ali
    Ilana Traynis
    Dushyantsinh Jadeja
    Rajroshan Sawhney
    Sunny Virmani
    Pradeep Venkatesh
    Nikhil Tandon
    Ophthalmology and Therapy (2025)
    Preview abstract Introduction While many deep learning systems (DLSs) for diabetic retinopathy (DR) have been developed and validated on cohorts with an average age of 50s or older, fewer studies have examined younger individuals. This study aimed to understand DLS performance for younger individuals, who tend to display anatomic differences, such as prominent retinal sheen. This sheen can be mistaken for exudates or cotton wool spots, and potentially confound DLSs. Methods This was a prospective cross-sectional cohort study in a “Diabetes of young” clinic in India, enrolling 321 individuals between ages 18 and 45 (98.8% with type 1 diabetes). Participants had fundus photographs taken and the photos were adjudicated by experienced graders to obtain reference DR grades. We defined a younger cohort (age 18–25) and an older cohort (age 26–45) and examined differences in DLS performance between the two cohorts. The main outcome measures were sensitivity and specificity for DR. Results Eye-level sensitivity for moderate-or-worse DR was 97.6% [95% confidence interval (CI) 91.2, 98.2] for the younger cohort and 94.0% [88.8, 98.1] for the older cohort (p = 0.418 for difference). The specificity for moderate-or-worse DR significantly differed between the younger and older cohorts, 97.9% [95.9, 99.3] and 92.1% [87.6, 96.0], respectively (p = 0.008). Similar trends were observed for diabetic macular edema (DME); sensitivity was 79.0% [57.9, 93.6] for the younger cohort and 77.5% [60.8, 90.6] for the older cohort (p = 0.893), whereas specificity was 97.0% [94.5, 99.0] and 92.0% [88.2, 95.5] (p = 0.018). Retinal sheen presence (94% of images) was associated with DME presence (p < 0.0001). Image review suggested that sheen presence confounded reference DME status, increasing noise in the labels and depressing measured sensitivity. The gradability rate for both DR and DME was near-perfect (99% for both). Conclusion DLS-based DR screening performed well in younger individuals aged 18–25, with comparable sensitivity and higher specificity compared to individuals aged 26–45. Sheen presence in this cohort made identification of DME difficult for graders and depressed measured DLS sensitivity; additional studies incorporating optical coherence tomography may improve accuracy of measuring DLS DME sensitivity. View details
    Passive Heart Rate Monitoring During Smartphone Use in Everyday Life
    Shun Liao
    Paolo Di Achille
    Jiang Wu
    Jonathan Wang
    Eric Teasley
    Lawrence Cai
    Daniel McDuff
    Hao-Wei Su
    Brent Winslow
    Anupam Pathak
    Shwetak Patel
    Jim Taylor
    Jamie Rogers
    (2025)
    Preview abstract Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during ordinary smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions – the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) <10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error <5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring. View details
    Prospective Multi-Site Validation of AI to Detect Tuberculosis and Chest X-Ray Abnormalities
    Sahar Kazemzadeh
    Atilla Kiraly
    Nsala Sanjase
    Minyoi Maimbolwa
    Brian Shuma
    Shahar Jamshy
    Christina Chen
    Arnav Agharwal
    Chuck Lau
    Daniel Golden
    Jin Yu
    Eric Wu
    Kat Chou
    Shravya Shetty
    Krish Eswaran
    Rory Pilgrim
    Monde Muyoyeta
    NEJM AI (2024)
    Preview abstract Background Using artificial intelligence (AI) to interpret chest X-rays (CXRs) could support accessible triage tests for active pulmonary tuberculosis (TB) in resource-constrained settings. Methods The performance of two cloud-based CXR AI systems — one to detect TB and the other to detect CXR abnormalities — in a population with a high TB and human immunodeficiency virus (HIV) burden was evaluated. We recruited 1978 adults who had TB symptoms, were close contacts of known TB patients, or were newly diagnosed with HIV at three clinical sites. The TB-detecting AI (TB AI) scores were converted to binary using two thresholds: a high-sensitivity threshold and an exploratory threshold designed to resemble radiologist performance. Ten radiologists reviewed images for signs of TB, blinded to the reference standard. Primary analysis measured AI detection noninferiority to radiologist performance. Secondary analysis evaluated AI detection as compared with the World Health Organization (WHO) targets (90% sensitivity, 70% specificity). Both used an absolute margin of 5%. The abnormality-detecting AI (abnormality AI) was evaluated for noninferiority to a high-sensitivity target suitable for triaging (90% sensitivity, 50% specificity). Results Of the 1910 patients analyzed, 1827 (96%) had conclusive TB status, of which 649 (36%) were HIV positive and 192 (11%) were TB positive. The TB AI’s sensitivity and specificity were 87% and 70%, respectively, at the high-sensitivity threshold and 78% and 82%, respectively, at the balanced threshold. Radiologists’ mean sensitivity was 76% and mean specificity was 82%. At the high-sensitivity threshold, the TB AI was noninferior to average radiologist sensitivity (P<0.001) but not to average radiologist specificity (P=0.99) and was higher than the WHO target for specificity but not sensitivity. At the balanced threshold, the TB AI was comparable to radiologists. The abnormality AI’s sensitivity and specificity were 97% and 79%, respectively, with both meeting the prespecified targets. Conclusions The CXR TB AI was noninferior to radiologists for active pulmonary TB triaging in a population with a high TB and HIV burden. Neither the TB AI nor the radiologists met WHO recommendations for sensitivity in the study population. AI can also be used to detect other CXR abnormalities in the same population. View details
    Differences between Patient and Clinician Submitted Images: Implications for Virtual Care of Skin Conditions
    Rajeev Rikhye
    Grace Eunhae Hong
    Margaret Ann Smith
    Aaron Loh
    Vijaytha Muralidharan
    Doris Wong
    Michelle Phung
    Nicolas Betancourt
    Bradley Fong
    Rachna Sahasrabudhe
    Khoban Nasim
    Alec Eschholz
    Kat Chou
    Peggy Bui
    Justin Ko
    Steven Lin
    Mayo Clinic Proceedings: Digital Health (2024)
    Preview abstract Objective: To understand and highlight the differences in clinical, demographic, and image quality characteristics between patient-taken (PAT) and clinic-taken (CLIN) photographs of skin conditions. Patients and Methods: This retrospective study applied logistic regression to data from 2500 deidentified cases in Stanford Health Care’s eConsult system, from November 2015 to January 2021. Cases with undiagnosable or multiple conditions or cases with both patient and clinician image sources were excluded, leaving 628 PAT cases and 1719 CLIN cases. Demographic characteristic factors, such as age and sex were self-reported, whereas anatomic location, estimated skin type, clinical signs and symptoms, condition duration, and condition frequency were summarized from patient health records. Image quality variables such as blur, lighting issues and whether the image contained skin, hair, or nails were estimated through a deep learning model. Results: Factors that were positively associated with CLIN photographs, post-2020 were as follows: age 60 years or older, darker skin types (eFST V/VI), and presence of skin growths. By contrast, factors that were positively associated with PAT photographs include conditions appearing intermittently, cases with blurry photographs, photographs with substantial nonskin (or nail/hair) regions and cases with more than 3 photographs. Within the PAT cohort, older age was associated with blurry photographs. Conclusion: There are various demographic, clinical, and image quality characteristic differences between PAT and CLIN photographs of skin concerns. The demographic characteristic differences present important considerations for improving digital literacy or access, whereas the image quality differences point to the need for improved patient education and better image capture workflows, particularly among elderly patients. View details
    Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
    Atilla Kiraly
    Corbin Cunningham
    Ryan Najafi
    Jie Yang
    Chuck Lau
    Diego Ardila
    Scott Mayer McKinney
    Rory Pilgrim
    Mozziyar Etemadi
    Sunny Jansen
    Lily Peng
    Shravya Shetty
    Neeral Beladia
    Krish Eswaran
    Radiology: Artificial Intelligence (2024)
    Preview abstract Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity. View details
    Preview abstract Background Skin conditions are extremely common worldwide, and are an important cause of both anxiety and morbidity. Since the advent of the internet, individuals have used text-based search (eg, “red rash on arm”) to learn more about concerns on their skin, but this process is often hindered by the inability to accurately describe the lesion’s morphology. In the study, we surveyed respondents’ experiences with an image-based search, compared to the traditional text-based search experience. Methods An internet-based survey was conducted to evaluate the experience of text-based vs image-based search for skin conditions. We recruited respondents from an existing cohort of volunteers in a commercial survey panel; survey respondents that met inclusion/exclusion criteria, including willingness to take photos of a visible concern on their body, were enrolled. Respondents were asked to use the Google mobile app to conduct both regular text-based search (Google Search) and image-based search (Google Lens) for their concern, with the order of text vs. image search randomized. Satisfaction for each search experience along six different dimensions were recorded and compared, and respondents’ preferences for the different search types along these same six dimensions were recorded. Results 372 respondents were enrolled in the study, with 44% self-identifying as women, 86% as White and 41% over age 45. The rate of respondents who were at least moderately familiar with searching for skin conditions using text-based search versus image-based search were 81.5% and 63.5%, respectively. After using both search modalities, respondents were highly satisfied with both image-based and text-based search, with >90% at least somewhat satisfied in each dimension and no significant differences seen between text-based and image-based search when examining the responses on an absolute scale per search modality. When asked to directly rate their preferences in a comparative way, survey respondents preferred image-based search over text-based search in 5 out of 6 dimensions, with an absolute 9.9% more preferring image-based search over text-based search overall (p=0.004). 82.5% (95% CI 78.2 - 86.3) reported a preference to leverage image-based search (alone or in combination with text-based search) in future searches. Of those who would prefer to use a combination of both, 64% indicated they would like to start with image-based search, indicating that image-based search may be the preferred entry point for skin-related searches. Conclusion Despite being less familiar with image-based search upon study inception, survey respondents generally preferred image-based search to text-based search and overwhelmingly wanted to include this in future searches. These results suggest the potential for image-based search to play a key role in people searching for information regarding skin concerns. View details
    Towards a Personal Health Large Language Model
    Anastasiya Belyaeva
    Nick Furlotte
    Zhun Yang
    Chace Lee
    Erik Schenck
    Yojan Patel
    Jian Cui
    Logan Schneider
    Robby Bryant
    Ryan Gomes
    Allen Jiang
    Roy Lee
    Javier Perez
    Jamie Rogers
    Cathy Speed
    Shyam Tailor
    Megan Walker
    Jeffrey Yu
    Tim Althoff
    Conor Heneghan
    Mark Malhotra
    Shwetak Patel
    Shravya Shetty
    Jiening Zhan
    Yeswanth Subramanian
    Daniel McDuff
    arXiv (2024)
    Preview abstract Large language models (LLMs) can retrieve, reason over, and make inferences about a wide range of information. In health, most LLM efforts to date have focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into clinical tasks, provide a rich, continuous, and longitudinal source of data relevant for personal health monitoring. Here we present a new model, Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. To systematically evaluate PH-LLM, we created and curated three novel benchmark datasets that test 1) production of personalized insights and recommendations from measured sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep quality outcomes. For the insights and recommendations tasks we created 857 case studies in sleep and fitness. These case studies, designed in collaboration with domain experts, represent real-world scenarios and highlight the model’s capabilities in understanding and coaching. Through comprehensive human and automatic evaluation of domain-specific rubrics, we observed that both Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. To further assess expert domain knowledge, we evaluated PH-LLM performance on multiple choice question examinations in sleep medicine and fitness. PH-LLM achieved 79% on sleep (N=629 questions) and 88% on fitness (N=99 questions), both of which exceed average scores from a sample of human experts as well as benchmarks for receiving continuing credit in those domains. To enable PH-LLM to predict self-reported assessments of sleep quality, we trained the model to predict self-reported sleep disruption and sleep impairment outcomes from textual and multimodal encoding representations of wearable sensor data. We demonstrate that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM. View details
    Health AI Developer Foundations
    Atilla Kiraly
    Sebastien Baur
    Kenneth Philbrick
    Fereshteh Mahvar
    Liron Yatziv
    Tiffany Chen
    Bram Sterling
    Nick George
    Fayaz Jamil
    Jing Tang
    Kai Bailey
    Akshay Goel
    Abbi Ward
    Lin Yang
    Shravya Shetty
    Daniel Golden
    Tim Thelin
    Rory Pilgrim
    Can "John" Kirmizi
    arXiv (2024)
    Preview abstract Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer Foundations (HAI-DEF), a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building ML for health applications. The models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio. These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs compared to traditional approaches. In addition, we utilize a common interface and style across these models, and prioritize usability to enable developers to integrate HAI-DEF efficiently. We present model evaluations across various tasks and conclude with a discussion of their application and evaluation, covering the importance of ensuring efficacy, fairness, and equity. Finally, while HAI-DEF and specifically the foundation models lower the barrier to entry for ML in healthcare, we emphasize the importance of validation with problem- and population-specific data for each desired usage setting. This technical report will be updated over time as more modalities and features are added. View details