Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

Taedong Yun; Justin Cosentino; Babak Behsaz; Zachary Ryan Mccaw; Davin Hill; Robert Luben; Dongbing Lai; John Bates; Howard Yang; Tae-Hwi Schwantes-An; Yuchen Zhou; Anthony Khawaja; Andrew Carroll; Brian Hobbs; Michael Cho; Cory Y. McLean; Farhad Hormozdiari

Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

Taedong Yun

Justin Cosentino

Babak Behsaz

Zachary Ryan Mccaw

Davin Hill

Robert Luben

Dongbing Lai

John Bates

Howard Yang

Tae-Hwi Schwantes-An

Yuchen Zhou

Anthony Khawaja

Andrew Carroll

Brian Hobbs

Michael Cho

Cory Y. McLean

Farhad Hormozdiari

Nature Genetics (2024)

Download Google Scholar

Abstract

Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs