Using long and linked reads to generate a new Genome in a Bottle small variant benchmark

Aaron Wenger
Alexander Dilthey
Andrew Carroll
Arkarachai Fungtammasan
Chen-Shan Chin
Chunlin Xiao
Erik Garrison
Ian Fiddes
Jennifer McDaniel
Justin Wagner
Justin Zook
Lindsay Harris
Marc Salit
Mikko Rautiainen
Nate Olson
Qian Zeng
Shilpa Garg
Tobias Marschall
William J. Rowell
Genome Informatics (2019)
Google Scholar

Abstract

The Genome in a Bottle (GIAB) consortium performs authoritative characterization of broadly-consented human genomes to develop benchmarks for genome sequencing and bioinformatics methods. Here, we describe work towards generating a new GIAB small variant benchmark that incorporates long and linked read sequencing data. The GIAB benchmarks are created by integrating variant calls from multiple sequencing technologies and analysis methods. This integration systematically evaluates and arbitrates amongst the technologies and methods, taking advantage of their strengths and weaknesses to identify consensus calls, and regions containing those calls, that can be relied upon as benchmarks. The short read-based benchmark variants and genomic regions cover 87.8% of assembled bases in chromosomes 1-22 of GRCh37 for one genome (HG002). Because many clinically-relevant variants, such as those in CYP21A2, lie outside the current GIAB benchmark regions, expanding these is important for medical applications. Short read variant callers perform poorly in segmental duplications and low-complexity repeat-rich regions as well as other regions with high homology. We utilize PacBio CCS and 10x Genomics reads to expand the GIAB benchmark regions and reduce errors in current regions. Preliminary analyses suggest that long and linked reads might be able to add approximately 256,000 benchmark small variants along with expanding the coverage of GRCh37 by approximately 120 million base pairs. The draft benchmark covers substantially more challenging regions, such that the false negative rate for short read-based methods increases by a factor of about 7.6 to 17.2 relative to the current benchmark. Additionally, it corrects a few thousand potential errors in the current benchmark in difficult-to-map regions, such as LINEs, where the current short read-based methods are inaccurate, and it excludes other regions included in the current benchmark, such as copy number variants, where no current methods work well. We generated draft benchmark variant calls, worked with GIAB consortium members for evaluation, and are currently developing more robust calls in segmental duplications. We are developing a similar benchmark set for GRCh38, which represents segmental duplications better than GRCh37. To confirm the accuracy of the draft benchmark, we performed long range PCR and Sanger sequencing to test variants including those in CYP21A2 for which we confirmed 12 variants.
In addition to using long and linked reads with our existing integration approach, we worked with a team to use PacBio CCS, 10X Genomics, and Oxford Nanopore Technology reads to produce a fully phased diploid assembly of the highly variable MHC region of HG002. The team developed a method to haplotype partition reads using information across technologies and perform de novo assembly on each set of separated reads. The work resulted in a contig spanning the MHC region for each haplotype that shows high concordance with clinical HLA typing results for HG002. We are currently exploring approaches to incorporate the small variants detected from this assembly into the GIAB benchmark. This work highlights the use of multiple technologies to mitigate biases to generate a community resource and will enable benchmarking in challenging genomic regions.

Research Areas