Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
Date issued
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into
questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can
introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single
nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects
each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the
respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which
allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing
platform’s impact.
RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental
setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup,
indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased
concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers
were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content.
Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably
improved concordance between the respective setups.
CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative
experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic
data with harmonized pipelines when integrating data from different studies.