Please use this identifier to cite or link to this item: http://doi.org/10.25358/openscience-9990
Authors: Kallenborn, Felix
Advisor: Schmidt, Bertil
Hildebrandt, Andreas
Title: High-performance processing of next-generation sequencing data on CUDA-enabled GPUs
Online publication date: 14-Feb-2024
Year of first publication: 2024
Language: english
Abstract: With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Nowadays, software for processing large-scale datasets of sequencing reads may take hours to days to complete, even on high-end workstations. This explains the need for new approaches to achieve faster, high-performance applications. In contrast to traditional CPU-based software, algorithms utilizing the massively-parallel many-core architecture and fast memory of GPUs are potentially able to deliver the desired performance in many fields. In this thesis, we introduce two novel GPU-accelerated applications, CARE and CAREx, for common steps in sequence processing pipelines, error correction and read extension of Next Generation Sequencing (NGS) Illumina data, to improve the results of down-stream data analysis. To the best of our knowledge, CARE and CAREx are the first modern GPU-accelerated solutions for the respective problems. A key component of our algorithm is the identification of similar DNA sequences within a dataset. For this purpose, we developed a minhashing-based index data structure for large-scale read datasets. In conjunction with our fast bit-parallel shifted hamming distance computations, this allows for the efficient identification of similar reads. The resulting set of similar sequences is subsequently arranged into a gap-free multiple-sequence alignment to solve the problem at hand. Sequencing machines introduce both systematic errors and random errors. CARE, Context-Aware Read Error corrector, accurately removes errors introduced by NGS sequencing machines during the initial sequencing of a biological sample. With the help of a pre-trained Random Forest, CARE generates two orders-of-magnitude fewer false positives than its competitors. At the same time, it shows similar numbers of true positives. Read extension describes the process of elongating DNA sequences. The presence of longer sequences improves the resolution of more, larger structures within a genome. CAREx, Context-Aware Read Extender, produces longer sequences, so called pseudo-long reads, by connecting the two reads of read pairs which were sequenced in close proximity. Evaluation shows that CAREx produces significantly more highly accurate pseudo-long reads than the state-of-the-art. With algorithms tailored towards high-performance GPU computations, both CARE and CAREx run significantly faster than the CPU-based competitors, while, at the same time, produce more accurate results. The processing of a large Human dataset with 30x coverage with CARE requires less than 30 minutes using a single A100 GPU. This time can be further reduced down to 10 minutes on multi-GPU systems. In contrast, CPU-based tools like Musket or BFC take 3 hours and 1.5 hours, respectively. Read extension of a Human dataset with CAREx takes 3.3 hours to complete on a single GPU, whereas Konnector2 requires over a day to complete. This shows that large-scale sequence processing can greatly benefit from the usage of GPUs, and that multiple-sequence alignment-based algorithms should be considered despite their increased complexity because they provide great accuracy. While our general building blocks have been tailored towards our needs for error correction and read extension, they could also prove useful in other GPU-accelerated applications that process sequence data.
DDC: 004 Informatik
004 Data processing
Institution: Johannes Gutenberg-Universität Mainz
Department: FB 08 Physik, Mathematik u. Informatik
Place: Mainz
ROR: https://ror.org/023b0x485
DOI: http://doi.org/10.25358/openscience-9990
URN: urn:nbn:de:hebis:77-openscience-cb1c480e-ae28-4aca-9c82-46fa0514f5f94
Version: Original work
Publication type: Dissertation
License: CC BY-SA
Information on rights of use: https://creativecommons.org/licenses/by-sa/4.0/
Extent: ix, 140 Seiten ; Illustrationen, Diagramme
Appears in collections:JGU-Publikationen

Files in This Item:
  File Description SizeFormat
Thumbnail
highperformance_processing_of-20240129114611062.pdf2.57 MBAdobe PDFView/Open