Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Abstract
The major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We develop a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate, long reads and apply it to sequence the well-characterized human, HG002/NA24385, to 28-fold coverage with 13.5 kb CCS reads that average 99.5% accuracy. We apply existing tools to comprehensively detect variants, and achieve precision and recall above 99.9% for SNVs, 95.9% for indels, and 95.2% for structural variants. Nearly all (99.6%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance over Q45 (99.997%). From manual curation of discordances, we estimate 1,283 mistakes in the high-quality Genome in a Bottle benchmark are correctable with CCS reads. With only CCS reads, we match or exceed performance of variant detection with accurate short reads and assembly with noisy long reads.