Machine learning-assisted identification of factors affecting variability in multi-omics data
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
Recent advances in high-throughput technologies together with computational innovations have enabled the studying of biological systems at multiple levels, giving rise to integrative omics approaches. Multi-omics research refers to efforts that combine multiple omics datasets—including genes, transcripts, and proteins—obtained from the same samples to improve our understanding of biological processes. Over the past decades, omics technologies have led to new insights on complex molecular mechanisms underlying abnormal phenotypes and diseases, thus revolutionizing biomedical and biological research. This has resulted in the generation of a large volume of biological data, including that available in open-access sources. Nonetheless, comprehensive analysis of such data is not trivial and is particularly hampered by high dimensionality, noisy nature of the data, as well as the lack of standardized data analysis methods and pipelines. Therefore, it is necessary to focus on the integration of the omics data in the context of phenotypes and conditions of interest, which motivated the current research. This thesis investigates factors affecting biological and technical variability in the context of transcriptomics studies by applying Machine Learning (ML) and Integrative Data Analysis (IDA). In particular, the thesis proposes design and implementation of: (I) a bioinformatics pipeline (FAVSeq) for identification of key effectors for variation in multimodal RNA Sequencing (RNA-Seq) profiles from matched bulk and single-cell experiments and (II) an analysis tool for ML- and IDA-based studying of alternative splicing regulome (regulAS) comprising large-scale RNA-Seq from cancer and healthy patients from public omics data sources. Findings and tools presented in this thesis provide a basis for further experimental investigations of identified factors, as well as subsequent improvements at the level of RNA-Seq data preparation along with downstream analysis that allow to facilitate the fundamental research and biomedical applications based on RNA sequencing technologies.