Enhancing analysis and interpretation workflows for transcriptome data with an interactive R/Bioconductor toolkit
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
Over the recent years, bulk RNA-sequencing (RNA-seq) has become the gold standard
for transcriptome analysis, leading to a significant increase in the volume of data
and results being generated. Consequently, this growth has also made the task of interpreting these results increasingly challenging, particularly for functional enrichment
analyses. Functional enrichment analysis constitutes a fundamental step in the analysis of
various omics datasets, aiming to identify differentially regulated pathways between experimental
conditions and to draw insights into the underlying molecular mechanisms of
diseases and specific phenotypes. Due to this widespread use, there are numerous tools
and implementations available to calculate these results. Despite their utility, existing
methods often yield impractical outputs comprising extensive lists of gene sets, impeding
hypothesis generation and synthesis due to inherent redundancy in the found pathways.
Additionally, prevalent approaches for processing enrichment results lack consideration
of network-based information, which is a key factor that could enhance contextualisation
by incorporating interactions among gene set members.
In order to address these issues and facilitate the analysis and interpretation of bulk
RNA-seq data, we previously published a standardised workflow for bulk RNA-seq data
analysis, promoting interactive and reproducible processes using the R packages developed
in our group [Ludt et al., 2022]. This workflow provides a step-by-step documentation
of a typical bulk RNA-seq data analysis, guiding users through the individual steps,
showcasing best practices and promoting reproducibility. However, during our work, we
recognised a significant gap in the workflow: the need for a tool specifically designed to
enhance, simplify and standardise the interpretation of functional enrichment results.
While our previously developed package, GeneTonic, provides basic functionality for
enrichment result exploration, it is certainly not tailored to this task. Consequently, many
of our collaboration partners still resorted to the classical way of functional enrichment
result interpretation: a manual inspection of the extensive list of results searching for
patterns and interesting gene sets, which they further explored using databases such as
the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG).
However, this process can be prone to bias, as familiar and expected gene sets may be
easily recognised, while novel or unexpected findings might be overlooked, sometimes
simply due to the sheer amount of available results.
In order to evaluate whether this need was only prevalent to our research group
and collaboration partners or portrayed a larger issue within the scientific community, I
conducted a literature review. The findings confirmed that inadequate documentation
and reporting of functional enrichment methods are widespread across the scientific
community, with over 75% of reviewed studies failing to properly detail their analysis, thus
making it difficult to impossible for peers to verify and reproduce the results. Additionally,
the review revealed that published studies frequently highlight gene sets simply because
they appear at the top of the list of results due to their statistical significance. This could
overall imply that the large lists of functional enrichment results are not fully studied,
which could lead to important results and insights being lost. As part of this thesis, I developed a tool which streamlines and simplifies the interpretation
of functional enrichment results. These efforts are composed in GeDi, an
R/Bioconductor package that aggregates gene sets into meaningful clusters based on
various measures of (dis)similarity, thereby reducing redundancy and improving the clarity
of the results. GeDi achieves this by implementing a suite of Gene set Distance
metrics and clustering algorithms. Additionally, GeDi integrates protein-protein interaction
information into the analysis to provide a more comprehensive view of the biological
processes at play.
GeDi supports interactive exploration and detailed drill-down analyses within its
framework through an integrated Shiny application, while also allowing for seamless
integration into existing workflows via its stand-alone functionality. By offering multiple
entry points and accommodating a wide range of use cases, GeDi caters to a diverse
audience, particularly relevant given the increasing volume of enrichment analyses being
conducted. With interactive visualisations and result aggregations, GeDi not only
reduces the time required for researchers to analyse and interpret results but also helps
minimise bias that would be introduced by manual inspection, which is the current standard
practice in many analyses. In doing so, GeDi facilitates a more efficient and objective
interpretation of the data, as showcased in this thesis on publicly available bulk RNA-seq
data.
With its functionality for interactive data exploration, flexible stand-alone features,
and seamless integration into our standardised bulk RNA-seq workflow, GeDi aims to
improve the reporting standards of functional enrichment analyses in published research.
Additionally, GeDi promotes reproducibility through an automated report generation feature.
By making data interpretation more efficient, accessible and reproducible, GeDi has
the potential to drive new research efforts and simplify the generation of novel hypotheses,
ultimately advancing the field of omics analysis.