Jump to Content
Taedong Yun

Taedong Yun

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract High-dimensional clinical data have become invaluable resources for genetic studies, due to their accessibility in biobank-scale datasets and the development of high performance modeling techniques especially using deep learning. Recent work has shown that low dimensional embeddings of these clinical data learned by variational autoencoders (VAE) can be used for genome-wide association studies and polygenic risk prediction. In this work, we consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE, in the context of genetic studies. Using spirograms from UK Biobank as a running example, we observed improvements in the genome-wide significant loci, heritability, and polygenic risk scores for asthma and chronic obstructive pulmonary disease compared to VAE or (non-variational) autoencoders. We observed FactorVAEs are consistently effective for genomic discovery and risk prediction across multiple settings of the regularization hyperparameter, while beta-VAEs are much more sensitive to the hyperparameter values. View details
    Preview abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel. View details
    Unsupervised representation learning improves genomic discovery for lung function and respiratory disease prediction
    Babak Behsaz
    Zachary Ryan Mccaw
    Davin Hill
    Robert Luben
    Dongbing Lai
    John Bates
    Howard Yang
    Tae-Hwi Schwantes-An
    Yuchen Zhou
    Anthony Khawaja
    Andrew Carroll
    Brian Hobbs
    Michael Cho
    medRxiv (2023)
    Preview abstract High-dimensional clinical data are becoming more accessible in biobank-scale datasets. However, effectively utilizing high-dimensional clinical data for genetic discovery remains challenging. Here we introduce a general deep learning-based framework, REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE), for discovering associations between genetic variants and high-dimensional clinical data. REGLE uses convolutional variational autoencoders to compute a non-linear, low-dimensional, disentangled embedding of the data with highly heritable individual components. REGLE can incorporate expert-defined or clinical features and provides a framework to create accurate disease-specific polygenic risk scores (PRS) in datasets which have minimal expert phenotyping. We apply REGLE to both respiratory and circulatory systems: spirograms which measure lung function and photoplethysmograms (PPG) which measure blood volume changes. Genome-wide association studies on REGLE embeddings identify more genome-wide significant loci than existing methods and replicate known loci for both spirograms and PPG, demonstrating the generality of the framework. Furthermore, these embeddings are associated with overall survival. Finally, we construct a set of PRSs that improve predictive performance of asthma, chronic obstructive pulmonary disease, hypertension, and systolic blood pressure in multiple biobanks. Thus, REGLE embeddings can quantify clinically relevant features that are not currently captured in a standardized or automated way. View details
    Preview abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel. View details
    Preview abstract Genome-wide association studies (GWAS) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n=370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average). View details
    Preview abstract Genome-wide association studies (GWASs) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n = 370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average). View details
    How DeepConsensus Works
    Aaron Wenger
    Anastasiya Belyaeva
    Andrew Carroll
    Armin Töpfer
    Ashish Teku Vaswani
    Daniel Cook
    Felipe Llinares
    Gunjan Baid
    Howard Yang
    Jean-Philippe Vert
    Kishwar Shafin
    Maria Nattestad
    Waleed Ammar
    William J. Rowell
    (2022)
    Preview abstract N/A These are slides for a public video about DeepConsensus View details
    DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
    Aaron Wenger
    Andrew Walker Carroll
    Armin Töpfer
    Ashish Teku Vaswani
    Daniel Cook
    Felipe Llinares
    Gunjan Baid
    Howard Cheng-Hao Yang
    Jean-Philippe Vert
    Kishwar Shafin
    Maria Nattestad
    Waleed Ammar
    William J. Rowell
    Nature Biotechnology (2022)
    Preview abstract Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%. View details
    A population-specific reference panel for improved genotype imputation in African Americans
    Jared O’Connell
    Meghan Moreno
    Helen Li
    Nadia Litterman
    Elizabeth Noblin
    Anjali Shastri
    Elizabeth H. Dorfman
    Suyash Shringarpure
    23andMe Research Team
    Adam Auton
    Andrew Carroll
    Communications Biology (2021)
    Preview abstract There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics. View details
    Preview abstract Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trio from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio to learn how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling and calling with duos (child and one parent) solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi. View details
    Preview abstract Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and Implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online. View details
    Preview abstract Logistic regression remains one of the most widely used tools in applied statistics, machine learning and data science. Practical datasets often have a substantial number of features $d$ relative to the sample size $n$. In these cases, the logistic regression maximum likelihood estimator (MLE) is biased, and its standard large-sample approximation is poor. In this paper, we develop an improved method for debiasing predictions and estimating frequentist uncertainty for such datasets. We build on recent work characterizing the asymptotic statistical behavior of the MLE in the regime where the aspect ratio $d / n$, instead of the number of features $d$, remains fixed as $n$ grows. In principle, this approximation facilitates bias and uncertainty corrections, but in practice, these corrections require an estimate of the signal strength of the predictors. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude. The bias correction that this facilitates also reduces the variance of the predictions, yielding narrower confidence intervals with higher (valid) coverage of the true underlying probabilities and parameters. View details
    Preview abstract Exome and genome sequencing typically use a reference genome to map reads and call variants against. Many (if not a majority) of clinical and research workflows use the prior version of the human reference genome (GRCh37), although an updated and more complete version (GRCh38) was produced in 2013. We present a method that identifies potential artifacts when using one reference relative to a different reference. We simulate error-free reads from GRCh37 and GRCh38, and map and call variants from one read set to the opposite reference. When simulated reads are analyzed relative to their own reference, there are no variants called on GRCh37 and 14 on GRCh38. However, when GRCh38 reads are analyzed on GRCh37, there are 69,720 heterozygous variants called with GATK4-HC. Since the reference is monoploid, a heterozygous call is likely an artifact. Inspection suggests these represent segmental duplications captured in GRCh38, but excluded or collapsed in GRCh37. Some overlap with common resources: 32,688 are present in dbSNP, 28,830 are present gnomAD (with 25,062 listed as filtered for HWE violation), 19 HET variants and 199 HOM overlap ClinVar. In v3.3.2 Genome in a Bottle, 1,123 of these variants overlap the confident regions for HG002, and they are inconsistently labelled as variants or reference. DeepVariant, which is trained on the truth set, seems to have learned about this variability, allowing some measurement of segmental duplication to be made from its output. Reverse comparison using GRCh37 reads on GRCh38 finds only 30% as many HET variants. This suggests that migrating workflows to GRCh38 eliminates a number of recurrent artifacts, and could present an additional filtration resource for GRCh37 variant files and annotation resources. View details
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    Preview abstract In this work, we investigate variant calling across a pedigree of mosquito (Anopheles gambiae) genomes. Using rates of Mendelian violation, we assess pipelines developed to call variation in humans when applied to mosquito samples. We demonstrate the ability to rapidly retrain DeepVariant without the need for a gold standard set by using sites that are consistent versus inconsistent with Mendelian inheritance. We show that this substantially improves calling accuracy by tuning DeepVariant for the new genome context. Finally, we generate a model for accurate variant calling on low-coverage mosquito genomes and a corresponding variant callset. View details
    No Results Found