Ryan Poplin

Ryan Poplin

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Physicians are increasingly using clinical sequencing tests to establish diagnoses of patients who might have genetic disorders, which means that accuracy of sequencing and interpretation are important elements in ensuring the benefits of genetic testing. In the past, clinical sequencing tests were designed to detect specific prespecified or unknown variants that were in limited regions of an individual’s genome. The raw data for each detected variant was then manually reviewed for errors in sequencing and for its potential clinical importance. Newer technology allows for assessment of exomes or entire genomes and can identify millions of genetic variants in each sequenced individual. The shift from limited targeted sequencing to genome sequencing requires automated algorithms to parse through raw data to help distinguish true variants from those caused by systematic errors. Errors can result from incorrectly read bases in particular DNA molecule regions that are difficult to sequence and from mapping short sequences incorrectly to the human reference genome. New developments in sequencing and analysis, as well as standard quality measures, are critical to ensure the accuracy of sequencing results intended for medical use. View details
    Preview abstract Summary Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. Availability and implementation GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. View details
    Preview abstract Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images. View details
    Predicting Cardiovascular Risk Factors in Retinal Fundus Photographs using Deep Learning
    Avinash Vaidyanathan Varadarajan
    Katy Blumer
    Mike McConnell
    Lily Peng
    Nature Biomedical Engineering (2018)
    Preview abstract Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction. View details
    Preview abstract Objective: Refractive error, one of the leading cause of visual impairment, can be corrected by simple interventions like prescribing eyeglasses, which often starts with autorefraction to estimate the refractive error. In this study, using deep learning, we trained a network to estimate refractive error from fundus photos only. Design: Retrospective analysis. Subjects, Participants, and/or Controls: Retinal fundus images from participants in the UK Biobank cohort, which were 45 degree field of view images and the AREDS clinical trial, which contained 30 degree field of view images. Methods, Intervention, or Testing: Refractive error was measured by autorefraction in the UK Biobank dataset and subjective refraction in the AREDS dataset. We trained a deep learning algorithm to predict refractive error from the fundus photographs and tested the prediction of the algorithm to the documented refractive error measurement. Our model used attention for identifying features that are predictive for refractive error. Main Outcome Measures: Mean average error (MAE) of the algorithm’s prediction compared to the refractive error obtained in the AREDS and UK Biobank. Results: The resulting algorithm had a mean average error (MAE) of 0.56 diopters (95% CI: 0.55-0.56) for estimating spherical equivalent on the UK Biobank dataset and 0.91 diopters (95% CI: 0.89-0.92) for the AREDS dataset. The baseline expected MAE (obtained by simply predicting the mean of this population) is 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggest that the foveal region is one of the most important areas that is used by the algorithm to make this prediction, though other regions also contribute to the prediction. Conclusions: The ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images. In addition, given that several groups have recently shown that it is feasible to obtain retinal fundus photos using mobile phones and inexpensive attachments, this work may be particularly relevant in regions of the world where autorefractors may not be readily available. View details
    Preview abstract Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences. View details
    In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
    Eric Christiansen
    Mike Ando
    Ashkan Javaherian
    Gaia Skibinski
    Scott Lipnick
    Elliot Mount
    Alison O'Neil
    Kevan Shah
    Alicia K. Lee
    Piyush Goyal
    Liam Fedus
    Andre Esteva
    Lee Rubin
    Steven Finkbeiner
    Cell (2018)
    Preview abstract Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly. View details
    A universal SNP and small-indel variant caller using deep neural networks
    Scott Schwartz
    Dan Newburger
    Jojo Dijamco
    Nam Nguyen
    Pegah T. Afshar
    Sam S. Gross
    Lizzie Dorfman
    Mark A. DePristo
    Nature Biotechnology (2018)
    Preview abstract Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling. View details
    Learning to count mosquitoes for the Sterile Insect Technique
    Yaniv Ovadia
    Yoni Halpern
    Dilip Krishnan
    Daniel Newburger
    Proceedings of the 23nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
    Preview abstract Mosquito-borne illnesses such as dengue, chikungunya, and Zika are major global health problems, which are not yet addressable with vaccines and must be countered by reducing mosquito popula- tions. The Sterile Insect Technique (SIT) is a promising alternative to pesticides; however, effective SIT relies on minimal releases of female insects. This paper describes a multi-objective convolutional neural net to significantly streamline the process of counting male and female mosquitoes released from a SIT factory and provides a statistical basis for verifying strict contamination rate limits from these counts despite measurement noise. These results are a promis- ing indication that such methods may dramatically reduce the cost of effective SIT methods in practice. View details
    CrowdVariant: a crowdsourcing approach to classify copy number variants
    Peyton Greenside
    Justin Zook
    Marc Salit
    Madeleine Cule
    Mark DePristo
    BioRxiv (2016)
    Preview abstract Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small (< 10kbp) CNVs that are poorly captured by array-based methods. The lack of high quality benchmark callsets of small-scale CNVs has limited our ability to assess and improve CNV calling algorithms for NGS data. To address this issue we developed a crowdsourcing framework, called CrowdVariant, that leverages Google's high-throughput crowdsourcing platform to create a high confidence set of copy number variants for NA24385 (NIST HG002/RM 8391), an Ashkenazim reference sample developed in partnership with the Genome In A Bottle Consortium. In a pilot study we show that crowdsourced classifications, even from non-experts, can be used to accurately assign copy number status to putative CNV calls and thereby identify a high-quality subset of these calls. We then scale our framework genome-wide to identify 1,781 high confidence CNVs, which multiple lines of evidence suggest are a substantial improvement over existing CNV callsets, and are likely to prove useful in benchmarking and improving CNV calling algorithms. Our crowdsourcing methodology may be a useful guide for other genomics applications. View details