
Justin Cosentino
Research Areas
Authored Publications
Sort By
Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction
Babak Behsaz
Zachary Ryan Mccaw
Davin Hill
Robert Luben
Dongbing Lai
John Bates
Howard Yang
Tae-Hwi Schwantes-An
Yuchen Zhou
Anthony Khawaja
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2024)
Preview abstract
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.
View details
Towards a Personal Health Large Language Model
Anastasiya Belyaeva
Nick Furlotte
Zhun Yang
Chace Lee
Erik Schenck
Yojan Patel
Jian Cui
Logan Schneider
Robby Bryant
Ryan Gomes
Allen Jiang
Roy Lee
Javier Perez
Jamie Rogers
Cathy Speed
Shyam Tailor
Megan Walker
Jeffrey Yu
Tim Althoff
Conor Heneghan
Mark Malhotra
Shwetak Patel
Shravya Shetty
Jiening Zhan
Yeswanth Subramanian
Daniel McDuff
arXiv (2024)
Preview abstract
Large language models (LLMs) can retrieve, reason over, and make inferences about a wide range of information. In health, most LLM efforts to date have focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into clinical tasks, provide a rich, continuous, and longitudinal source of data relevant for personal health monitoring. Here we present a new model, Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. To systematically evaluate PH-LLM, we created and curated three novel benchmark datasets that test 1) production of personalized insights and recommendations from measured sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep quality outcomes. For the insights and recommendations tasks we created 857 case studies in sleep and fitness. These case studies, designed in collaboration with domain experts, represent real-world scenarios and highlight the model’s capabilities in understanding and coaching. Through comprehensive human and automatic evaluation of domain-specific rubrics, we observed that both Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. To further assess expert domain knowledge, we evaluated PH-LLM performance on multiple choice question examinations in sleep medicine and fitness. PH-LLM achieved 79% on sleep (N=629 questions) and 88% on fitness (N=99 questions), both of which exceed average scores from a sample of human experts as well as benchmarks for receiving continuing credit in those domains. To enable PH-LLM to predict self-reported assessments of sleep quality, we trained the model to predict self-reported sleep disruption and sleep impairment outcomes from textual and multimodal encoding representations of wearable sensor data. We demonstrate that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.
View details
Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models
Babak Behsaz
Babak Alipanahi
Zachary Ryan Mccaw
Davin Hill
Tae-Hwi Schwantes-An
Dongbing Lai
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2023)
Preview abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case-control status from high-dimensional raw spirograms and use the model's predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.
View details
Multimodal LLMs for health grounded in individual-specific data
Anastasiya Belyaeva
Krish Eswaran
Shravya Shetty
Andrew Carroll
Nick Furlotte
ICML Workshop on Machine Learning for Multimodal Healthcare Data (2023)
Preview abstract
Large language models (LLMs) have shown an impressive ability to solve tasks in a wide range of fields including health. Within the health domain, there are many data modalities that are relevant to an individual’s health status. To effectively solve tasks related to individual health, LLMs will need the ability to use a diverse set of features as context. However, the best way to encode and inject complex high-dimensional features into the input stream of an LLM remains an active area of research. Here, we explore the ability of a foundation LLM to estimate disease risk given health-related input features. First, we evaluate serialization of structured individual-level health data into text along with in context learning and prompt tuning approaches. We find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to baseline after prompt tuning. Next, we propose a way to encode complex non-text data modalities into the token embedding space and then use this encoding to construct multimodal sentences. We show that this multimodal LLM achieves better or equivalent performance compared to baseline models. Overall, our results show the potential for using multi-modal LLMs grounded in individual health data to solve complex tasks such as risk prediction.
View details
Preview abstract
Genome-wide association studies (GWAS) are used to identify genetic variants significantly correlated with a target disease or phenotype as a first step to detect potentially causal genes. The availability of high-dimensional biomedical data in population-scale biobanks has enabled novel machine-learning-based phenotyping approaches in which machine learning (ML) algorithms rapidly and accurately phenotype large cohorts with both genomic and clinical data, increasing the statistical power to detect variants associated with a given phenotype. While recent work has demonstrated that these methods can be extended to diseases for which only low quality medical-record-based labels are available, it is not possible to quantify changes in statistical power since the underlying ground-truth liability scores for the complex, polygenic diseases represented by these medical-record-based phenotypes is unknown. In this work, we aim to empirically study the robustness of ML-based phenotyping procedures to label noise by applying varying levels of random noise to vertical cup-to-disc ratio (VCDR), a quantitative feature of the optic nerve that is predictable from color fundus imagery and strongly influences glaucoma referral risk. We show that the ML-based phenotyping procedure recovers the underlying liability score across noise levels, significantly improving genetic discovery and PRS predictive power relative to noisy equivalents. Furthermore, initial denoising experiments show promising preliminary results, suggesting that improving such methods will yield additional gains.
View details
Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
Babak Alipanahi
Babak Behsaz
Zachary Ryan Mccaw
Emanuel Schorsch
D. Sculley
Lizzie Dorfman
Sonia Phene
Andrew Walker Carroll
Anthony Khawaja
American Journal of Human Genetics (2021)
Preview abstract
Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5×10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR in independent datasets.
View details