RNA-Seq and CoverageAnalyzer reveal sequence dependent reverse transcription signature of N-1-methyladenosine
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
The discovery of pseudouridine (Psi) as the fifth sequence residue of RNA 60 years ago marked the beginning of a successive extension of the known alphabet of ribonucleic acids up to currently around 150 different nucleotide derivatives. Mapping and functional association of these modifications are the essential emphases of one of the most topical and dynamic areas of modern life sciences, the exploration of the epitranscriptome. Beyond the advanced state of knowledge concerning the densely and systematically modified tRNAs and rRNAs, major breakthroughs were achieved in the class of coding transcripts during the last years. Basis for detection is a modification-specific behavior of Reverse Transcriptase (RT) in the transcription of RNA to cDNA, an RT signature. The combination of Next Generation Sequencing (NGS) with specific labeling or immunoprecipitation revealed individual modification landscapes in mRNA for e.g. Psi, m5C and m6A, partially with evidence for regulatory relevance.
This PhD thesis addressed the development of bioinformatic methods for description and identification of nucleotide modifications based on Deep Sequencing data. The concept was demonstrated by the characterization of the RT signature of N-1-methyladenosine (m1A). This adenosine residue, methylated at the Watson-Crick edge, occurs in tRNAs of bacteria, archea and eukarya, and called attention by its recent discovery in numerous mammalian mRNAs. Whereas the software developed in this project also allows comparison of RT effects after differential chemical treatment, analysis of m1A relied on native signatures only, i.e. without specific labeling or antibody-mediated enrichment. Artificially induced m1A instances are of interest in structural probing of RNA, wherein the local methylation efficiency is interpreted as the accessibility of nucleotides to the solvent, i.e. as the degree of structuring of RNA strands. The detection is based on the tendency of the modification to block RT, which is reflected by accumulation of abortive products at the respective position in gel electrophoresis or in sequencing profiles of primer extension assays. In turn, according to previous studies, read-through products exhibit a preferred composition of misincorporated cDNA residues at m1A sites.
The hence dual RT signature of m1A, consisting of arrest and misincorporation rates, was characterized and differentiated by the present work based on natural instances in tRNA and rRNA, for the purpose of improved resolution and enhanced recognition potential. Arrest and read-through products were captured by a specialized protocol for preparation of cDNA libraries ready for sequencing. The digital analysis was carried out by comparison of sequencing data to reference sequences. Core of the workflow is the standalone software CoverageAnalyzer, which was engineered in the scope of this work as a universal platform for processing, visualization and screening of sequencing profiles for signature features. In this way, m1A signatures were extracted and then analyzed by descriptive and inferential statistics, also in terms of their capability of discrimination from non- or otherwise modified adenosines with noticeable RT features. Supervised machine learning with Random Forest models for recognition of m1A in adenosine pools staggered by distinction difficulty shed light on usage potential of eight formulated features, including a context-sensitive descriptor of RT stops. Furthermore, it showed the benefit of simultaneous utilization of mismatch- and arrest related information and highlighted the special nature of m1A among native RT signatures of adenosine derivatives, which allows the sensitive and specific detection of m1A.
Achievements in discovery of unreported m1A sites in human, mouse and T. brucei were made by signature comparison and sequence homology. With the help of synthetic oligoribonucleotides, the picture was refined by effects of incomplete levels of modification. Artificial instances moreover confirmed a central result of this study: the composition of mismatches in m1A's RT signature depends on the sequence context, namely the identity of the 3'-adjacent nucleotide.
The developed analytical methodology, the specialized software as well as findings regarding m1A's RT signature with implications for other modifications prepare the ground for revisal of existing predictions and for advancement of mapping strategies for the epitranscriptome.