Maxwell Bileschi

Maxwell Bileschi

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools. View details
    Critiquing Protein Family Classification Models Using Sufficient Input Subsets
    Brandon Michael Carter
    Jamie Alexander Smith
    Theo Sanderson
    ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2019) (to appear)
    Preview abstract In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset introduced, e.g., as a function of when and how the data were collected. In response, we propose a set of methods for critiquing deep learning models, and demonstrate their application for protein family classification, a task for which high- accuracy models have considerable potential impact. Our methods extend the recently-introduced sufficient input subsets technique (SIS), which we use to identify the subset of locations (SIS) in each protein sequence that is sufficient for classification. Our suite of tools analyzes these SIS to shed light on the decision making criteria employed by models trained on this task. These tools expose that while these deep models may perform classification for biologically-relevant reasons, their behavior varies considerably across choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential. We encourage further application of our techniques for interrogating machine learning models trained on other scientifically relevant tasks. View details
    Preview abstract Functional genomics approaches to better model genotype-phenotype relationships have important applications toward understanding genomic function and improving human health. In particular, thousands of noncoding loci associated with diseases and physical traits lack mechanistic explanation. Here, we develop the first machine-learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable GWAS loci fine mapping. View details