Deep Learning Classifies the Protein Universe

Max Bileschi; David Belanger; Drew Bryant; Theo Sanderson; Brandon Carter; D. Sculley; Mark DePristo; Lucy Colwell

Deep Learning Classifies the Protein Universe

Max Bileschi

David Belanger

Drew Bryant

Theo Sanderson

Brandon Carter

D. Sculley

Mark DePristo

Lucy Colwell

Nature Biotechnology (2019)

Download Google Scholar

Abstract

Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Deep Learning Classifies the Protein Universe

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs