Google Research

Continuous embeddings of DNA sequencing reads, and application to metagenomics

Journal of Computational Biology, vol. 26(6) (2019), pp. 509-518

Abstract

We propose a new model for fast classification of DNA sequences output by next generation sequencing machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k-mers it contains. We show on metagenomics benchmarks that it outperforms state-of-the-art methods in terms of accuracy and scalability.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work