Novel bioinformatic tools and methods to study Next Generation Sequencing data with a focus on DNA repair and genome stability

Date issued

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

ItemDissertationOpen Access

Abstract

Next Generation Sequencing is a widely used technology that enables precise identification and quantification of nucleic acids. Advanced sequencing-based experimental protocols have enabled the investigation of their modifications, organization, interaction, and regulation, among others. This thesis introduces three novel methodologies implemented as software packages for facilitating the comprehensive analysis, visualization and interpretation of *omics* sequencing data. In *Chapter 1* we describe the problem of PCR clonal artefacts in RNA-seq and enrichment-based assays, such as ChIP-seq. We present the tool *dupRadar*, a novel method to tell apart those PCR artifacts from normal read duplication due to natural over-sequencing of highly expressed genes or enriched loci. We apply our method to detect over-sequenced libraries of limited complexity in cases of little input material in a synthetic dataset and also in public datasets of bulk RNA-seq and single-cell RNA-seq. We found that datasets generated from lower input material exhibit limited library complexity, leading to increased duplication rates even among lowly expressed genes. Finally, we run differential expression analysis to demonstrate that even low levels of PCR artifacts can have an influence on downstream analysis and data interpretation. *Chapter 2* introduces *rrvgo*, a novel tool for interpreting large lists of Gene Ontology terms. The package gives access to several semantic similarity methods; here, I apply the *Relevance* method to GO terms significantly enriched in the publicly available gene expression data from the breast cancer study published by Schmidt et al. in 2008, comparing grade III to grade I breast cancer patients. This approach identifies clusters of potentially redundant terms with high correlation of information content within the set of GO terms. We further demonstrate the utility of rrvgo's visualizations, which facilitate the detection and refinement of a non-redundant set of GO terms for more focused biological interpretation. *Chapter 3* introduces *BreakTag*, an innovative approach for genome-wide identification and quantification of DNA double-strand breaks and their structural characteristics at single nucleotide resolution using high-throughput sequencing. Additionally, we developed *breakinspectoR*, a bioinformatics pipeline designed to detect, quantify and study the end structure of Cas9-induced DSBs in BreakTag data. Using BreakTag, we analyzed cleavage patterns by SpCas9 across three genome-wide CRISPR libraries, comprising 3,500 distinct single-guide RNAs, and identified over 150,000 on- and off-target cleavage sites. Analysis of DSB break ends revealed that approximately 35% of the identified breaks exhibit staggered ends. A machine learning model trained using target site sequence composition and DSB end structure data revealed that protospacer sequence significantly influences Cas9 incision patterns. Furthermore, by examining matched datasets of Cas9 cleavage sites and subsequent repair outcomes, we found a link between staggered breaks and single-nucleotide insertions. In conclusion, these findings demonstrate that the structure of Cas9 DSB ends is sequence-dependent, suggesting that guide RNAs can be strategically designed to produce precise, predictable repair outcomes. This approach may provide new opportunities for correcting diseases caused by single-nucleotide deletions. Overall during my PhD, in collaboration with wet-lab researchers, I have developed novel tools and methods to a broad range of applications of *omics* sequencing data, with special focus on the study of DNA repair and genome stability.

Description

Keywords

Citation

Relationships