Maxwell Bileschi
Authored Publications
Sort By
Critiquing Protein Family Classification Models Using Sufficient Input Subsets
Brandon Michael Carter
Jamie Alexander Smith
Theo Sanderson
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2019) (to appear)
Preview abstract
In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset introduced, e.g., as a function of when and how the data were collected. In response, we propose a set of methods for critiquing deep learning models, and demonstrate their application for protein family classification, a task for which high- accuracy models have considerable potential impact. Our methods extend the recently-introduced sufficient input subsets technique (SIS), which we use to identify the subset of locations (SIS) in each protein sequence that is sufficient for classification. Our suite of tools analyzes these SIS to shed light on the decision making criteria employed by models trained on this task. These tools expose that while these deep models may perform classification for biologically-relevant reasons, their behavior varies considerably across choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential. We encourage further application of our techniques for interrogating machine learning models trained on other scientifically relevant tasks.
View details
Deep Learning Classifies the Protein Universe
Theo Sanderson
Brandon Carter
Mark DePristo
Nature Biotechnology (2019)
Preview abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools.
View details
Sequential regulatory activity prediction across chromosomes with convolutional neural networks
David Kelley
Yakir Reshef
Genome Research (2018)
Preview abstract
Functional genomics approaches to better model genotype-phenotype relationships have important
applications toward understanding genomic function and improving human health. In particular,
thousands of noncoding loci associated with diseases and physical traits lack mechanistic
explanation. Here, we develop the first machine-learning system to predict cell type-specific
epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone.
Using convolutional neural networks, this system identifies promoters and distal regulatory elements
and synthesizes their content to make effective gene expression predictions. We show that model
predictions for the influence of genomic variants on gene expression align well to causal variants
underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses
to enable GWAS loci fine mapping.
View details