Blindfolding DeepVariant: Surprising Insights from Hiding Information

Lucas Brambrink
Daniel Cook
Mo Samman
Andrew Carroll
(2024)
Google Scholar

Abstract

DeepVariant is a deep-learning-based variant caller. It uses a CNN to classify variants based on pileups of sequenced DNA fragments aligned to the candidate sites. Metadata from these reads, such as base quality or mapping quality, get encoded into separate channels, much like RGB channels of color images.

Among the latest improvements to DeepVariant is the ability to fully customize the set of channels that are passed to the model. We ran a series of ablation experiments in which we 1) removed one of the six base channels and 2) removed all but one channel. These models were effectively blind at varying degrees to information normally available to DeepVariant. We therefore expected some degradation in accuracy, but to our surprise the loss in accuracy was not uniform: we observed specific patterns of classification errors.

From these experiments we uncovered two key findings:
- The read_supports_variant channel is critical for classifying multiallelic variants. Without it, the model cannot differentiate between homozygous alternate (1/1) and heterozygous alternate (1/2) variants.
- In the absence of better information, DeepVariant will learn to use more subtle cues such as read length distribution to differentiate between genotypes.
×