Enhancing analysis and interpretation workflows for transcriptome data with an interactive R/Bioconductor toolkit

ItemDissertationOpen Access

Abstract

Over the recent years, bulk RNA-sequencing (RNA-seq) has become the gold standard for transcriptome analysis, leading to a significant increase in the volume of data and results being generated. Consequently, this growth has also made the task of interpreting these results increasingly challenging, particularly for functional enrichment analyses. Functional enrichment analysis constitutes a fundamental step in the analysis of various omics datasets, aiming to identify differentially regulated pathways between experimental conditions and to draw insights into the underlying molecular mechanisms of diseases and specific phenotypes. Due to this widespread use, there are numerous tools and implementations available to calculate these results. Despite their utility, existing methods often yield impractical outputs comprising extensive lists of gene sets, impeding hypothesis generation and synthesis due to inherent redundancy in the found pathways. Additionally, prevalent approaches for processing enrichment results lack consideration of network-based information, which is a key factor that could enhance contextualisation by incorporating interactions among gene set members. In order to address these issues and facilitate the analysis and interpretation of bulk RNA-seq data, we previously published a standardised workflow for bulk RNA-seq data analysis, promoting interactive and reproducible processes using the R packages developed in our group [Ludt et al., 2022]. This workflow provides a step-by-step documentation of a typical bulk RNA-seq data analysis, guiding users through the individual steps, showcasing best practices and promoting reproducibility. However, during our work, we recognised a significant gap in the workflow: the need for a tool specifically designed to enhance, simplify and standardise the interpretation of functional enrichment results. While our previously developed package, GeneTonic, provides basic functionality for enrichment result exploration, it is certainly not tailored to this task. Consequently, many of our collaboration partners still resorted to the classical way of functional enrichment result interpretation: a manual inspection of the extensive list of results searching for patterns and interesting gene sets, which they further explored using databases such as the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, this process can be prone to bias, as familiar and expected gene sets may be easily recognised, while novel or unexpected findings might be overlooked, sometimes simply due to the sheer amount of available results. In order to evaluate whether this need was only prevalent to our research group and collaboration partners or portrayed a larger issue within the scientific community, I conducted a literature review. The findings confirmed that inadequate documentation and reporting of functional enrichment methods are widespread across the scientific community, with over 75% of reviewed studies failing to properly detail their analysis, thus making it difficult to impossible for peers to verify and reproduce the results. Additionally, the review revealed that published studies frequently highlight gene sets simply because they appear at the top of the list of results due to their statistical significance. This could overall imply that the large lists of functional enrichment results are not fully studied, which could lead to important results and insights being lost. As part of this thesis, I developed a tool which streamlines and simplifies the interpretation of functional enrichment results. These efforts are composed in GeDi, an R/Bioconductor package that aggregates gene sets into meaningful clusters based on various measures of (dis)similarity, thereby reducing redundancy and improving the clarity of the results. GeDi achieves this by implementing a suite of Gene set Distance metrics and clustering algorithms. Additionally, GeDi integrates protein-protein interaction information into the analysis to provide a more comprehensive view of the biological processes at play. GeDi supports interactive exploration and detailed drill-down analyses within its framework through an integrated Shiny application, while also allowing for seamless integration into existing workflows via its stand-alone functionality. By offering multiple entry points and accommodating a wide range of use cases, GeDi caters to a diverse audience, particularly relevant given the increasing volume of enrichment analyses being conducted. With interactive visualisations and result aggregations, GeDi not only reduces the time required for researchers to analyse and interpret results but also helps minimise bias that would be introduced by manual inspection, which is the current standard practice in many analyses. In doing so, GeDi facilitates a more efficient and objective interpretation of the data, as showcased in this thesis on publicly available bulk RNA-seq data. With its functionality for interactive data exploration, flexible stand-alone features, and seamless integration into our standardised bulk RNA-seq workflow, GeDi aims to improve the reporting standards of functional enrichment analyses in published research. Additionally, GeDi promotes reproducibility through an automated report generation feature. By making data interpretation more efficient, accessible and reproducible, GeDi has the potential to drive new research efforts and simplify the generation of novel hypotheses, ultimately advancing the field of omics analysis.

Description

Keywords

Citation

Relationships