A deep learning approach to pattern recognition for short DNA sequences

Akosua Busia; George Dahl; Clara Fannjiang; David Alexander; Lizzie Dorfman; Ryan Poplin; Cory McLean; Pi-Chuan Chang; Mark DePristo

A deep learning approach to pattern recognition for short DNA sequences

Akosua Busia

George Dahl

Clara Fannjiang

David Alexander

Lizzie Dorfman

Ryan Poplin

Cory McLean

Pi-Chuan Chang

Mark DePristo

bioArxiv (2018)

Download Google Scholar

Abstract

Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

A deep learning approach to pattern recognition for short DNA sequences

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs