DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Gunjan Baid
Daniel Cook
Kishwar Shafin
Felipe Llinares
Anastasiya Belyaeva
Armin Töpfer
Aaron Wenger
William J. Rowell
Howard Cheng-Hao Yang
Waleed Ammar
Jean-Philippe Vert
Ashish Teku Vaswani
Maria Nattestad
Andrew Walker Carroll
Nature Biotechnology (2022)

Abstract

Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%.
×