Ryan Poplin
Research Areas
Authored Publications
Sort By
Preview abstract
Physicians are increasingly using clinical sequencing tests to establish diagnoses of patients who might have genetic disorders, which means that accuracy of sequencing and interpretation are important elements in ensuring the benefits of genetic testing. In the past, clinical sequencing tests were designed to detect specific prespecified or unknown variants that were in limited regions of an individual’s genome. The raw data for each detected variant was then manually reviewed for errors in sequencing and for its potential clinical importance. Newer technology allows for assessment of exomes or entire genomes and can identify millions of genetic variants in each sequenced individual. The shift from limited targeted sequencing to genome sequencing requires automated algorithms to parse through raw data to help distinguish true variants from those caused by systematic errors. Errors can result from incorrectly read bases in particular DNA molecule regions that are difficult to sequence and from mapping short sequences incorrectly to the human reference genome. New developments in sequencing and analysis, as well as standard quality measures, are critical to ensure the accuracy of sequencing results intended for medical use.
View details
Preview abstract
Summary
Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome.
Availability and implementation
GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp.
View details
Likelihood Ratios for Out-of-Distribution Detection
Peter J. Liu
Mark DePristo
Josh Dillon
arXiv preprint arXiv:1906.02845 (2019)
Preview abstract
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.
View details
Predicting Cardiovascular Risk Factors in Retinal Fundus Photographs using Deep Learning
Avinash Vaidyanathan Varadarajan
Katy Blumer
Mike McConnell
Lily Peng
Nature Biomedical Engineering (2018)
Preview abstract
Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
View details
Deep learning for predicting refractive error from retinal fundus images
Avinash Vaidyanathan Varadarajan
Katy Blumer
Reena Chopra
Pearse Keane
Lily Peng
IOVS (2018)
Preview abstract
Objective: Refractive error, one of the leading cause of visual impairment, can be corrected by simple interventions like prescribing eyeglasses, which often starts with autorefraction to estimate the refractive error. In this study, using deep learning, we trained a network to estimate refractive error from fundus photos only.
Design: Retrospective analysis.
Subjects, Participants, and/or Controls: Retinal fundus images from participants in the UK Biobank cohort, which were 45 degree field of view images and the AREDS clinical trial, which contained 30 degree field of view images.
Methods, Intervention, or Testing: Refractive error was measured by autorefraction in the UK Biobank dataset and subjective refraction in the AREDS dataset. We trained a deep learning algorithm to predict refractive error from the fundus photographs and tested the prediction of the algorithm to the documented refractive error measurement. Our model used attention for identifying features that are predictive for refractive error.
Main Outcome Measures: Mean average error (MAE) of the algorithm’s prediction compared to the refractive error obtained in the AREDS and UK Biobank.
Results: The resulting algorithm had a mean average error (MAE) of 0.56 diopters (95% CI: 0.55-0.56) for estimating spherical equivalent on the UK Biobank dataset and 0.91 diopters (95% CI: 0.89-0.92) for the AREDS dataset. The baseline expected MAE (obtained by simply predicting the mean of this population) is 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggest that the foveal region is one of the most important areas that is used by the algorithm to make this prediction, though other regions also contribute to the prediction.
Conclusions: The ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images. In addition, given that several groups have recently shown that it is feasible to obtain retinal fundus photos using mobile phones and inexpensive attachments, this work may be particularly relevant in regions of the world where autorefractors may not be readily available.
View details
A deep learning approach to pattern recognition for short DNA sequences
Akosua Busia
Clara Fannjiang
Lizzie Dorfman
Mark DePristo
bioArxiv (2018)
Preview abstract
Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences.
View details
In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
Eric Christiansen
Mike Ando
Ashkan Javaherian
Gaia Skibinski
Scott Lipnick
Elliot Mount
Alison O'Neil
Kevan Shah
Alicia K. Lee
Piyush Goyal
Liam Fedus
Andre Esteva
Lee Rubin
Steven Finkbeiner
Cell (2018)
Preview abstract
Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly.
View details
A universal SNP and small-indel variant caller using deep neural networks
Scott Schwartz
Dan Newburger
Jojo Dijamco
Nam Nguyen
Pegah T. Afshar
Sam S. Gross
Lizzie Dorfman
Mark A. DePristo
Nature Biotechnology (2018)
Preview abstract
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
View details
Learning to count mosquitoes for the Sterile Insect Technique
Yaniv Ovadia
Yoni Halpern
Dilip Krishnan
Daniel Newburger
Proceedings of the 23nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
Preview abstract
Mosquito-borne illnesses such as dengue, chikungunya, and Zika
are major global health problems, which are not yet addressable
with vaccines and must be countered by reducing mosquito popula-
tions. The Sterile Insect Technique (SIT) is a promising alternative
to pesticides; however, effective SIT relies on minimal releases of
female insects. This paper describes a multi-objective convolutional
neural net to significantly streamline the process of counting male
and female mosquitoes released from a SIT factory and provides a
statistical basis for verifying strict contamination rate limits from
these counts despite measurement noise. These results are a promis-
ing indication that such methods may dramatically reduce the cost
of effective SIT methods in practice.
View details
Preview abstract
Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small (< 10kbp) CNVs that are poorly captured by array-based methods. The lack of high quality benchmark callsets of small-scale CNVs has limited our ability to assess and improve CNV calling algorithms for NGS data. To address this issue we developed a crowdsourcing framework, called CrowdVariant, that leverages Google's high-throughput crowdsourcing platform to create a high confidence set of copy number variants for NA24385 (NIST HG002/RM 8391), an Ashkenazim reference sample developed in partnership with the Genome In A Bottle Consortium. In a pilot study we show that crowdsourced classifications, even from non-experts, can be used to accurately assign copy number status to putative CNV calls and thereby identify a high-quality subset of these calls. We then scale our framework genome-wide to identify 1,781 high confidence CNVs, which multiple lines of evidence suggest are a substantial improvement over existing CNV callsets, and are likely to prove useful in benchmarking and improving CNV calling algorithms. Our crowdsourcing methodology may be a useful guide for other genomics applications.
View details