Uncaptured segmental duplication creates artifacts in workflows using GRCh37

Andrew Carroll
Google Scholar


Exome and genome sequencing typically use a reference genome to map reads and call variants against. Many (if not a majority) of clinical and research workflows use the prior version of the human reference genome (GRCh37), although an updated and more complete version (GRCh38) was produced in 2013. We present a method that identifies potential artifacts when using one reference relative to a different reference. We simulate error-free reads from GRCh37 and GRCh38, and map and call variants from one read set to the opposite reference. When simulated reads are analyzed relative to their own reference, there are no variants called on GRCh37 and 14 on GRCh38. However, when GRCh38 reads are analyzed on GRCh37, there are 69,720 heterozygous variants called with GATK4-HC. Since the reference is monoploid, a heterozygous call is likely an artifact. Inspection suggests these represent segmental duplications captured in GRCh38, but excluded or collapsed in GRCh37. Some overlap with common resources: 32,688 are present in dbSNP, 28,830 are present gnomAD (with 25,062 listed as filtered for HWE violation), 19 HET variants and 199 HOM overlap ClinVar. In v3.3.2 Genome in a Bottle, 1,123 of these variants overlap the confident regions for HG002, and they are inconsistently labelled as variants or reference. DeepVariant, which is trained on the truth set, seems to have learned about this variability, allowing some measurement of segmental duplication to be made from its output. Reverse comparison using GRCh37 reads on GRCh38 finds only 30% as many HET variants. This suggests that migrating workflows to GRCh38 eliminates a number of recurrent artifacts, and could present an additional filtration resource for GRCh37 variant files and annotation resources.

Research Areas