Continuous embeddings of DNA sequencing reads, and application to metagenomics

Romain Menegaux
Jean-Philippe Vert
Journal of Computational Biology, 26(6) (2019), pp. 509-518

Abstract

We propose a new model for fast classification of DNA sequences output by next generation sequencing
machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k-mers it contains. We show on metagenomics benchmarks that it outperforms state-of-the-art methods in terms of accuracy and scalability.