Cory McLean
Cory is a senior staff software engineer in Google Research who leads the Genomics research team. His research interests broadly include applying machine learning to the analysis and interpretation of genomic data and publishing tools and methods as open-source software. Prior to Google, Cory was at 23andMe where he developed algorithms and tools to improve identity-by-descent detection, haplotype phasing, and genotype imputation, and the application of genetic association study results to drug development. Cory received a PhD in computer science from Stanford, where he developed computational methods to understand vertebrate gene regulation, and a BS in computer science from MIT.
Research Areas
Authored Publications
Sort By
Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction
Babak Behsaz
Zachary Ryan Mccaw
Davin Hill
Robert Luben
Dongbing Lai
John Bates
Howard Yang
Tae-Hwi Schwantes-An
Yuchen Zhou
Anthony Khawaja
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2024)
Preview abstract
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.
View details
Predicting Cardiovascular Disease Risk using Photoplethysmography and Deep Learning
Sebastien Baur
Christina Chen
Mariam Jabara
Babak Behsaz
Shravya Shetty
Goodarz Danaei
Diego Ardila
PLOS Glob Public Health, 4(6) (2024), e0003204
Preview abstract
Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS’s C-statistic (71.1%, 95% CI 69.9–72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7–72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6–1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.
View details
Longitudinal fundus imaging and its genome-wide association analysis provides evidence for a human retinal aging clock
Sara Ahadi
Kenneth A Wilson Jr,
Orion Pritchard
Ajay Kumar
Enrique M Carrera
Ricardo Lamy
Jay M Stewart
Avinash Varadarajan
Pankaj Kapahi
Ali Bashir
eLife (2023)
Preview abstract
Background
Biological age, distinct from an individual’s chronological age, has been studied extensively through predictive aging clocks. However, these clocks have limited accuracy in short time-scales. Deep learning approaches on imaging datasets of the eye have proven powerful for a variety of quantitative phenotype inference and provide an opportunity to explore organismal aging and tissue health.
Methods
Here we trained deep learning models on fundus images from the EyePacs dataset to predict individuals’ chronological age. These predictions lead to the concept of a retinal aging clock which we then employed for a series of downstream longitudinal analyses. The retinal aging clock was used to assess the predictive power of aging inference, termed eyeAge, on short time-scales using longitudinal fundus imaging data from a subset of patients. Additionally, the model was applied to a separate cohort from the UK Biobank to validate the model and perform a GWAS. The top candidate gene was then tested in a fly model of eye aging.
Findings
EyeAge was able to predict the age with a mean absolute error of 3.26 years, which is much less than other aging clocks. Additionally, eyeAge was highly independent of blood marker-based measures of biological age (e.g. “phenotypic age”), maintaining a hazard ratio of 1.026 even in the presence of phenotypic age. Longitudinal studies showed that the resulting models were able to predict individuals’ aging, in time-scales less than a year with 71% accuracy. Notably, we observed a significant individual-specific component to the prediction. This observation was confirmed with the identification of multiple GWAS hits in the independent UK Biobank cohort. The knockdown of the top hit, ALKAL2, which was previously shown to extend lifespan in flies, also slowed age-related decline in vision in flies.
Interpretation
In conclusion, predicted age from retinal images can be used as a biomarker of biological aging in a given individual independently from phenotypic age. This study demonstrates the utility of retinal aging clock for studying aging and age-related diseases and quantitatively measuring aging on very short time-scales, potentially opening avenues for quick and actionable evaluation of gero-protective therapeutics.
View details
Multimodal LLMs for health grounded in individual-specific data
Anastasiya Belyaeva
Krish Eswaran
Shravya Shetty
Andrew Carroll
Nick Furlotte
ICML Workshop on Machine Learning for Multimodal Healthcare Data (2023)
Preview abstract
Large language models (LLMs) have shown an impressive ability to solve tasks in a wide range of fields including health. Within the health domain, there are many data modalities that are relevant to an individual’s health status. To effectively solve tasks related to individual health, LLMs will need the ability to use a diverse set of features as context. However, the best way to encode and inject complex high-dimensional features into the input stream of an LLM remains an active area of research. Here, we explore the ability of a foundation LLM to estimate disease risk given health-related input features. First, we evaluate serialization of structured individual-level health data into text along with in context learning and prompt tuning approaches. We find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to baseline after prompt tuning. Next, we propose a way to encode complex non-text data modalities into the token embedding space and then use this encoding to construct multimodal sentences. We show that this multimodal LLM achieves better or equivalent performance compared to baseline models. Overall, our results show the potential for using multi-modal LLMs grounded in individual health data to solve complex tasks such as risk prediction.
View details
Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models
Babak Behsaz
Babak Alipanahi
Zachary Ryan Mccaw
Davin Hill
Tae-Hwi Schwantes-An
Dongbing Lai
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2023)
Preview abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case-control status from high-dimensional raw spirograms and use the model's predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.
View details
Accurate human genome analysis with Element Avidity sequencing
Andrew Carroll
Bryan Lajoie
Daniel Cook
Kelly N. Blease
Kishwar Shafin
Lucas Brambrink
Maria Nattestad
Semyon Kruglyak
bioRxiv (2023)
Preview abstract
We investigate the new sequencing technology Avidity from Element Biosciences. We show that Avidity whole genome sequencing matches mapping and variant calling accuracy with Illumina at high coverages (30x-50x) and is noticeably more accurate at lower coverages (20x-30x). We quantify base error rates of Element reads, finding lower error rates, especially in homopolymer and tandem repeat regions. We use Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. We show that longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages.
View details
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
Aaron Wenger
Andrew Walker Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Cheng-Hao Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
Nature Biotechnology (2022)
Preview abstract
Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%.
View details
How DeepConsensus Works
Aaron Wenger
Anastasiya Belyaeva
Andrew Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
(2022)
Preview abstract
N/A
These are slides for a public video about DeepConsensus
View details
DeepNull models non-linear covariate effects to improve phenotypic prediction and association power
Andrew Carroll
Babak Alipanahi
Zachary Ryan Mccaw
Nick Furlotte
Nature Communications (2022)
Preview abstract
Genome-wide association studies (GWAS) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n=370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average).
View details
Knowledge distillation for fast and accurate DNA sequence correction
Anastasiya Belyaeva
Joel Shor
Daniel Cook
Kishwar Shafin
Daniel Liu
Armin Töpfer
Aaron Wenger
William J. Rowell
Howard Yang
Andrew Carroll
Maria Nattestad
Learning Meaningful Representations of Life (LMRL) Workshop NeurIPS 2022
Preview abstract
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer–encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models.
View details