- Romain Menegaux
- Jean-Philippe Vert
Fast mapping of sequencing reads to taxonomic clades is a crucial step in metagenomics, which however raises computational challenges as the numbers of reads and of taxonomic clades increases. Besides alignment-based methods, which are accurate but computational costly, faster compositional approaches have recently been proposed to predict the taxonomic clade of a read based on the set of $k$-mers it contains. Machine learning-based compositional approaches, in particular, have recently reached accuracies similar to alignment-based models, while being considerably faster. It has been observed that the accuracy of these models increases with the length $k$ of the $k$-mers they use, however existing methods are limited to handle $k$-mers of lengths up to $k=12$ or $13$ because of their large memory footprint needed to store the model coefficients for each possible $k$-mer. In order to explore the performance of machine learning-based compositional approaches for longer $k$-mers than currently possible, we propose to reduce the memory footprint of these methods by binning together $k$-mers that appear together in the sequencing reads used to train the models. We achieve this binning by learning a vector embedding for the vertices of a compacted de Bruijn graph, allowing us to embed any DNA sequence in a low-dimensional vector space where a machine learning system can be trained. The resulting method, which we call Brume, allows us to train compositional machine learning-based models with $k$-mers of length up to $k=31$. We show on two metagenomics benchmark that Brume reaches better performance than previously achieved, thanks to the use of longer $k$-mers.