From DNA sequences to cell types by detecting regulatory genomic regions in sequencing data
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
One of the big questions in biology today is to understand which genetic and
epigenetic factors are involved in the regulation of gene expression, and in
which cases their deregulation can contribute to the development of abnormal
phenotypes or diseases. Innovations in genome sequencing techniques and
corresponding data processing algorithms have enabled unbiased
interrogation of the different genomic and epigenomic components of
transcription at nucleotide resolution. Therefore, it is now possible to use and
integrate different types of data for both bulk and single-cell samples, and to
understand the molecular components of gene expression regulation using
ad-hoc reproducible computational analysis.
As an interdisciplinary field, bioinformatics takes advantage of different
quantitative disciplines, such as statistics and machine learning. This allows
the implementation of detailed analyses to support and elucidate specific
fundamental discoveries, and also to test unexpected predictions coming from
exploratory data analysis. In particular, the use of bioinformatics is a necessity
in the study of the genomic basis of gene regulation given the complexity of
the data produced. Thus, the application of existing and the development of
novel bioinformatics methods improves the interpretation of new data by
integrating several data types from multiple sources.
In this thesis I applied and developed bioinformatics methods to help
investigate basic biological questions in the genomic study of epigenetic gene
regulation: i) I created a pipeline for whole-genome bisulfite sequencing data
analysis to improve the understanding of the way genes and DNA sequences
are demethylated by GADD45 proteins and how this might be linked to a key
stage of development in mouse embryonic stem cells (mESCs), ii) I developed
a metric based on the Gini index to evaluate unsupervised clustering results
obtained using several computational methods that were tested to identify
various types of peripheral blood mononuclear cells (PBMCs) from single-cell
ATAC-seq samples in which the labels of the cells were not provided and iii) I
developed an algorithm to extract variable regions in ChIP-seq data that can
improve the identification of target-specific binding sites of different proteins
in several cell lines of the ENCODE project. Together, these three studies are
a significant contribution to the improvement of the interpretation of genomic
data for the study of epigenetic gene regulation by bioinformatics.