Systematic interaction interface and variant characterization using protein interaction profiling Dissertation Zur Erlangung des Grades Doktor der Naturwissenschaften Am Fachbereich Biologie Der Johannes Gutenberg-Universität Mainz Dalmira Hubrich geb. am 28.07.1993 in Qostanay, Kazakhstan Mainz, Oktober 2024 Dekan: Prof. Dr. Eckhard Thines 1. Berichterstatter: Prof. Dr Brian Luke 2. Berichterstatter: Dr. Anton Khmelinskii Tag der mündlichen Prüfung: 21.10.2024 "The important thing is to never stop questioning." – Albert Einstein" Acknowledgements First of all, I would like to express my deepest gratitude to my thesis supervisor, Katja Luck, for her contagious enthusiasm, inspiration, and guidance throughout my journey into the fascinating world of protein interactions. Her constant support, patience, and willingness to engage in discussions on any subject at any time have been invaluable. Besides staying motivated, she also taught me what it means to be a true scientist: how to embrace doubt, maintain a healthy sense of uncertainty, and always stay critical and specific in my work. I am especially thankful for her significant contributions to my publications, her immense effort in mentoring and training, and her unwavering dedication to the development and completion of this thesis. Moreover, I am deeply appreciative of the opportunities she provided to discuss and collaborate with other scientists, which allowed me to feel part of a larger scientific community. This sense of belonging and engagement has been instrumental in my development as a researcher. I am immensely thankful to the Luck research group for their constant support, both in the lab setting and with their computational efforts. Their professionalism, collaboration, and willingness to contribute to this project have been invaluable. I am particularly grateful to my colleagues for helping establish protocols, sharing knowledge, and always being there to lend a hand. I am also grateful to our lab manager, Mareen Welzel, for her unwavering support and for keeping our spirits high with a steady supply of chocolate. Her sweet contributions not only made our workdays brighter but also helped us power through many challenging experiments! Special thanks to Dr. Chop Yan Lee, my partner in crime, for our successful team- work and collaboration. Beyond the lab, he has also become a dear friend, and I am grateful for his presence and support throughout my PhD journey. I would also like to extend my sincere thanks to my TAC committee, Prof. Dr. Brian Luke and Dr. Sandra Schick, for their invaluable contributions to my work. Their advice and shared experiences helped me grow as a scientist, enhanced my learning curve, and contributed greatly to my progress throughout this journey. I would also like to extend my sincere thanks to Dr. Julian Konig and his research group, particularly Stefanie Ebersberger and Dr. Miriam Murloz, as well as Prof. Dr. Michael Sattler and his group, especially Dr. Klara Hipp, Dr. Hyun-Seo Kang, and Dr. Santiago Martinez-Lumbreras. Being part of such a fruitful collaboration was a valuable experience, and I am deeply grateful for the opportunity to engage in meaningful discussions, share ideas, and learn from each of them. I would also like to extend my gratitude to the Protein Production and Mi- croscopy facilities and the Media Lab at the IMB Institute for their exceptional support. Their assistance with producing efficient reagents, their expert consulting, and the provision of cutting-edge equipment were crucial in addressing the scientific questions in my study. 1 I would like to thank the Emmy Noether funding, which provided me with the opportunity to pursue my PhD. This support has been instrumental in addressing significant scientific questions, contributing to new knowledge, and applying current insights to better benefit human society. I would like to express my gratitude to the PhD program and the IMB community for the invaluable experience of pursuing my PhD. The chance to meet and collaborate with esteemed scientists, exchange knowledge, and learn from recognized experts has been a profound learning experi- ence. This opportunity has not only deepened my understanding of science but also helped me appreciate what it means to be a scientist. Finally, my heartfelt thanks go to my family, especially my dearest husband and best friend, Yannik Hubrich. His unwavering support, patience, and belief in me throughout my PhD journey have been invaluable. I cannot imagine reaching this point without his constant encouragement and understanding. His sacrifices and steadfast presence have been a cornerstone of my success. Yannik has been a true partner in every sense, sharing in the highs and lows, and his love and support have been a source of strength and inspiration. I am deeply grateful for his belief in me and for being my rock throughout this demanding journey. I am also grateful to my dog, Sushi, who has been a calm and patient companion during the demanding times when I had to fully immerse myself in science. His quiet presence and unspoken understanding have been a source of comfort and joy. Я также благодарна за поддержку и веру в меня со стороны моих родителей и сестры. Несмотря на то что мы находимся далеко друг от друга, они постоянно поддерживают моё стремление учиться, исследовать и развивать карьеру. Их безусловная любовь всегда согревает моё сердце и мотивирует меня становиться лучше для них. Спасибо за то, что привили мне любовь к знаниям! 2 Contents 1 Introduction 5 1.1 The Complexity of Human Genetic Variation . . . . . . . . . . . . . . 5 1.1.1 Factors contributing to the complexity of variant interpretation 9 1.1.2 Comparative PPI profiling as the strategy to interpret variant effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Modular architecture of proteins . . . . . . . . . . . . . . . . . . . . . 14 1.2.1 Folded domains . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.2 Intrinsically disordered regions . . . . . . . . . . . . . . . . . . 19 1.2.3 Short linear motifs . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3 Domain-motif interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.4 Predicting the known occurrence of DMIs in protein interactions using sequence-based approaches . . . . . . . . . . . . . . . . . . . . . . . . 29 1.5 Systematic experimental validation of putative DMIs . . . . . . . . . 31 1.6 Aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2 The development of the medium-throughput cloning and the BRET assay pipeline for the experimental validation of predicted DMIs 38 2.1 Preparation of the wild-type human ORFeome collection . . . . . . . 38 2.2 The assessment of the sensitivity of BRET assay . . . . . . . . . . . . 39 2.3 Article I: FUBP1 is a general splicing factor facilitating 3’ splice site recognition and splicing of long introns . . . . . . . . . . . . . . . . . 41 2.3.1 Supplementary material . . . . . . . . . . . . . . . . . . . . . 80 2.4 Article II: Systematic discovery of protein interaction interfaces using AlphaFold and experimental validation . . . . . . . . . . . . . . . . . 100 2.4.1 Supplementary material . . . . . . . . . . . . . . . . . . . . . 126 3 Systematic domain-motif interaction interface and variant charac- terization using protein interaction profiling 164 3.1 Development of domain-motif interface predictor tool . . . . . . . . . 164 3.1.1 The workflow of the DMI predictor . . . . . . . . . . . . . . . 164 3.1.2 The application of the tool on HuRI PPI dataset . . . . . . . 165 3.2 Integrating ClinVar mutation data with putative DMIs mapped on HuRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 3.3 The data-driven approach to select disease-associated proteins and PPIs suitable for the experimental validation of DMIs . . . . . . . . . 168 3.3.1 Retestement of PPIs using BRET assay . . . . . . . . . . . . . 169 3.3.2 Testing the localization of the wild-type proteins and mutants using Bioluminescence Imaging . . . . . . . . . . . . . . . . . 171 3.3.3 Validation of DMI predictions . . . . . . . . . . . . . . . . . . 172 3.4 The application of the strategy of the variant effect on PPIs . . . . . 189 3 4 Conclusion and future perspectives 202 4.1 Deciphering protein interaction interfaces using DMI predictor tool . 202 4.2 The application of DDI predictor and AlphaFold to map the PPI data with interaction interfaces . . . . . . . . . . . . . . . . . . . . . . . . 203 4.3 Enhancing Predictive Accuracy of Variant Effects and Mutation De- sign through Positioning on Predicted AF-MM Interface Structures . 204 4.4 Improvement of the BRET assay to validate the predicted interfaces . 204 4.5 General outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Appendix 208 5 Appendix 208 5.1 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 5.1.1 The medium-throughput cloning protocol . . . . . . . . . . . . 208 5.1.2 The medium-throughput site-directed mutagenesis . . . . . . . 227 5.1.3 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Bibliography 244 4 Chapter 1 Introduction 1.1 The Complexity of Human Genetic Variation Genetic variation is a primary factor in evolution, driving the appear- ance of new phenotypes with various degrees of adaptability to environ- mental factors. A human genetic variation is defined as the diversity in DNA sequences and genetic characteristics among the individuals within populations (Alberts et al. 2002). These variations arise from replication errors or spontaneous nu- cleotide alterations that occur in DNA replication during cell division. In addition to these endogenous factors, exogenous influences such as radiation or chemicals can also cause changes in the genome. Genetic variation occurs at different scales from structural to single-point muta- tions. Structural variants are usually found to have a size of 1Mb and happen on chromosome level (e.g.fragile sites) (Chaisson et al. 2019). On the contrary, small variants span from duplications, deletions, inser- tions and inversions to short nucleotide polymorphisms or SNPs (Nesta et al. 2021) Figure 1.1). Figure 1.1: Small and structural variants. On the left are the small variants like single nucleotide variants (SNVs), insertion and deletion (indel). On the right, examples of up to 1Mb changes like inversion, deletion, insertion, duplication and translocation that constitute structural variants are shown. In each example, the top chromosome is the reference and the variation is highlighted and displayed below. 5 The abundance of small variants is much higher than the structural variants. About 85 million SNPs compared to 69000 structural variants found in the human genome (Consortium et al. 2015). These mu- tations can affect coding and non-coding regions or splice sites of the genome. They can be inherited or occur de novo in the germline. Many of these mutations are linked to various diseases, as they can alter pro- tein structure or function, disrupt cellular processes, and contribute to disease phenotypes (Visscher et al. 2012). Over the last decade, the genomics field has rapidly advanced. The development and application of large-scale next-generation sequencing (NGS) such as whole genome (WGS) and whole exome (WES) sequenc- ing, has significantly expanded the capacity for comprehensive analy- sis of genetic variation. WGS is the method that sequences the entire genome of an organism including coding and non-coding regions, and captures all genetic variations. While WES is another approach focused solely on the exome sequencing or coding regions of the genome. These methods have both advantages and limitations, that should be taken into account. WES is more cost-effective than WGS and useful for iden- tifying the mutations that affect the protein function. On the other side, it misses the variations in non-coding regions that can also impair gene regulation and cause the disease. Additionally, it’s less effective in finding structural variants in comparison to WGS. To further expand our understanding of genetic variation, large-scale initiatives like the Exome Aggregation Consortium (ExAC) have emerged. ExAC contains the exomes of over 60,706 individuals, providing an extensive catalog of genetic variants across the whole exomes (Lek et al. 2016). With the integration of genome data, ExAC evolved into the Genome Aggre- gation Database or shortly gnomAD. gnomADAD is the largest public open-access human genome allele frequency reference database, which contains exome sequencing data from 730,947 individuals and genome sequencing data from 76,215 individuals. Each individual’s data is annotated with the population information. gnomAD includes data from various populations such as African, Latino, East Asian, South Asian, European, and others. The collected sequenc- ing data is computationally processed to identify variants like SNPs, insertions, deletions, and other types of genetic variations. Up to now, it houses 786,500,648 single nucleotide variants, 122,583,462 InDels and over 1.2 million genome-level structural variants from more than 807162 individuals (gnomAD 2024). For each variant, the allele frequency is calculated, where the number of times the variant appears is divided by the total number of alleles observed at that position in the popula- tion. Researchers use this database to find threshold levels of variant 6 frequency within and across different populations. The knowledge about these levels help in understanding whether a variant is common or rare globally or only within specific populations (Karczewski et al. 2020). While gnomAD provides valuable information on the frequency of genetic variants, it is not sufficient to determine which variants might be disease-causing. This is because population data alone lacks clinical and phenotypic information. For this purpose, patient data is essential. Sequenced patient data allows researchers to prioritize genetic variants with potential associations with specific diseases. To do prioritization, patient data is compared with population data (control) such as pro- vided by gnomAD. This comparison involves the frequency analysis to examine whether a variant is more frequent in patients with the disease compared to healthy controls. Variants that are rare or absent from the general population, but found in affected individuals may be prioritized for further study. Next, statistical analysis is applied to assess whether the frequency of a variant is significantly higher in patients with the disease. This is only possible when there is a sufficient amount of pa- tient data available. This helps identify variants that are statistically associated with the disease. The patient information is also used in studying the inheritance patterns within families to see if a variant co- segregates with the disease, which helps confirm its potential causative role. Finally, these variants found through the comparative analysis might undergo downstream functional studies. Recognizing the need for comprehensive clinical data to better un- derstand genetic variants, led to the development of patient databases. They represent the archives of the reported variants from patients submitted by clinical testing laboratories, research laboratories, locus- specific databases, expert panels, and other groups. The largest patient variant database currently known is ClinVar (Landrum et al. 2016). It is maintained at the National Center for Biotechnology Information (NCBI) within the National Library of Medicine and National Insti- tutes of Health. Submissions to ClinVar must include a description of the variant(s), the interpreted condition, the clinical significance, an optional mode of inheritance, and supporting evidence. Variants in ClinVar are classified based on the available evidence, in- cluding genetic studies, population frequency data, computational pre- dictions, functional assays, and clinical observations. Pathogenic vari- ants have statistical association with disease in large studies, evidence of segregation with disease in families, functional studies showing dele- terious effects on gene or protein function, and consistent clinical ob- servations in affected individuals. Benign variants are those that have been found frequently in healthy individuals. Variants that are between, 7 due to factors such as low frequency, insufficient population data, lack of functional studies or conflicting evidence are classified as variants of uncertain significance or VUS. Currently, ClinVar stores over 4 mil- lion submitted records and 2,966,675 genetic variants (Landrum et al. 2016; ClinVar Miner 2024). Of these, approximately 1,527,893 variants are VUS (Figure 1.2). Figure 1.2: The variant distribution in ClinVar database. The bar plot on the left displays the total number of all variants (black) and the number of VUS (Variants of Uncertain Significance) (gray). On the right, the bar plot illustrates the distribution of various types of variants present in ClinVar for both the 2022 and 2023 versions. Despite the remarkable advancements in sequencing technologies and increased data availability for research, the main challenge persists. The vast majority of variants remain poorly characterized. This number continues to grow exponentially each year, posing significant challenges for research and clinical practice. The accumulation of uncharacterized variants without corresponding progress in variant interpretation limits our understanding of the mechanism of diseases. Consequently, clini- cians may struggle to interpret genetic test results, leading to potential delays in diagnosis and the development of precision medicine, where the treatment is tailored to each patient. For patients, this uncertainty can potentially result in anxiety, unnecessary treatments and missed opportunities for early intervention. Thus, why is it so challenging to characterize the variant effect? To address this question, it is important to understand the underlying reasons behind the complexity of variant interpretation. Therefore, I will discuss them in the next subsection. 8 1.1.1 Factors contributing to the complexity of variant inter- pretation The complexity of variant interpretation can be attributed to three pri- mary reasons. First, the architecture of most diseases is highly complex, involving multiple genes. This means that a single variant may not have a straightforward impact on disease risk. Instead, its effect might be modulated by other factors like the interactions between gene products, the presence of other genetic variants or environmental factors. Even in Mendelian disorders, where one gene is the primary cause of the disease, the severity, onset and progression of a disease can still be influenced by additional genetic factors. Second, many uncharacterized variants that occur infrequently in the human population, known as rare variants present a significant chal- lenge. These variants are carried by only a small number of individuals. Additionally, every healthy individual on average carries about 60 de novo mutations (DNMs) that arise spontaneously and are not inherited from either parent, so-called ultra-rare variants can be extremely diffi- cult to interpret (Figure 1.3). This low frequency of rare and ultra-rare variants results in small sample sizes, reducing the statistical power of tests. Statistical power refers to the ability of a test to detect a true effect when it exists. For instance, if a rare variant is present in only 0.1 % of the population in a study of 1000 participants, meaning that only 1 person will carry that variant. The statistical power will be too low because the small sample size reduces the distinction between true association from random fluctuations in data. With too few individuals carrying a rare variant, it becomes nearly impossible to apply standard statistical tests (e.g. Chi-square, Fisher’s exact tests) effectively. Figure 1.3: Schematic representation of the challenge in statistically as- sociated de novo variants. Every healthy individual on average carries about 60 new coding mutations, most of them being ultra-rare in the human population. The low occurrence of these mutations makes it hard to do statistical association and discrimination of pathogenic from being variants. Third, the impact of genetic variants on protein function can vary widely, spanning from benign alterations with no distinct effect to mu- 9 tations that cause severe dysfunction of the protein or disease (Figure 1.4). The traditional view on the mutation effect destabilizing the pro- tein and leading to the loss-of-function (LoF) has evolved. For example, nonsense mutations possess a premature stop codon that leads to the truncated version of the protein, often misfolded or destabilized. The effect of the mutation might cause a severe phenotype. For example, a nonsense mutation in the DMD gene results in dysfunctional protein mu- tant Glu1157TER, causing muscular degeneration and Duchenne Mus- cular Dystrophy (DMD) (Bulman et al. 1991). Likewise, frameshift mutations, which result from deletions or insertions and alter the gene’s frame can cause a total LoF due to extensive missense sequences followed by premature termination. Figure 1.4: Overview of the various effects of mutation on PPIs. Mu- tations can have various effects: they may have no impact, destabilize and unfold the protein, cause a gain-of-function effect or partially affect protein function by disrupting PPI. Although these types of mutations are known to be most detrimental and cause the disease, the reality is more complex, as in the case of the different mutations in the same gene causing distinct clinical outcomes (Zhong et al. 2009). This complexity arises because the gene and gene products do not function in isolation but in constant interactions with each other building interactome networks (Vidal et al. 2011). Goh et al (2007) assumed that some mutations possess a partial loss of function perturbing the interactions within this complex network (Goh et al. 2007). The studies suggest that about 30-60 % of all pathogenic missense variants are destabilizing, whereas up to 30 % of pathogenic missense mutations disrupt PPI, affecting some while leaving the others intact (Sahni et al. 2013). Given this information, how can we effi- ciently use it in variant characterization and learn about the potential mechanism of the disease? 10 1.1.2 Comparative PPI profiling as the strategy to interpret variant effect The previous studies highlight the potential of using protein-protein in- teraction (PPI) data to interpret variant effects (Zhang et al. 2009; Vidal et al. 2011; Sahni et al. 2013). These interactions are repre- sented as graphical networks, where proteins are displayed as "nodes" and the interactions between them as "edges". One promising experi- mental strategy that leverages this PPI network-based data is edgotyp- ing, aimed to systematically characterize variants and reveal molecular mechanisms potentially underlying disease (Sahni et al. 2015; Siwei Chen et al. 2018; Wierbowski et al. 2018; Starita et al. 2018; Fragoza et al. 2019). The idea is based on testing and comparing the effect of benign, pathogenic, and uncharacterized mutations within a protein on the protein’s interactions with its binding partners relative to the wild-type interactions. This comparison helps to identify the "ed- getic" effects, where a mutation might disrupt some interactions while leaving the others intact. Thus, by comparing the obtained PPI pro- files of benign, pathogenic, and uncharacterized variants we can predict the pathogenicity of this variant based on whether the interactions that are perturbed by the uncharacterized variant are similar to the interac- tion perturbations observed for the pathogenic variant. Moreover, the obtained perturbation data might be informative about potential mech- anistic causes of the disease, making the strategy powerful at elucidat- ing functional consequences of variants on PPIs and insight into variant contributions to the disease that go beyond traditional sequence-based studies. Available human PPI datasets To perform edgotyping, access to comprehensive and reliable protein in- teraction datasets is crucial. Over the past 15 years, high-quality human PPIs have been generated using large-scale approaches such as yeast two- hybrid (Y2H) and affinity purification coupled with mass spectrometry (AP-MS). They have become particularly instrumental in mapping PPIs and generating large-scale reference protein interactome datasets (Rol- land et al. 2014; Luck et al. 2020; Huttlin et al. 2021). Rolland et al (2014) made a significant input into this field by presenting a broad- ened version of the human interactome, HI-II-14, consisting of about 14000 distinct protein interaction pairs determined and confirmed by three binary PPI assays (Rolland et al. 2014). This available dataset has been instrumental in the research focused on variant characteriza- 11 tion. They further investigated the overall biological relevance of this PPI dataset assessing mutations associated with human disorders com- pared to common variants that showed no functional consequences on biophysical interactions. They showed that more than 55 % of the 107 tested PPIs were perturbed by at least one disease-associated variant. For example, the A129T mutation in the AANAT protein is known to be associated with delayed sleeping phase syndrome. It specifically dis- rupted the interaction with BHLHE40 involved in the regulation of cir- cadian rhythm. Another study utilized HI-II-14 and overlapped it with nearly 2000 mutations and identified 298 disruptive variants affecting almost 700 human protein interactions (Fragoza et al. 2019). In 2020, the human reference interactome or HuRI unveiled the largest binary interaction map of human proteins using the Y2H ap- proach. The HuRI project employed the Y2H technique, where two proteins are co-expressed in yeast cells if they physically interact, re- sulting in a total of 3 billion individual tests. This monumental effort generated a dataset of over 50000 high-confidence binary interactions between approximately 17000 human proteins. The depth and the scale of this study significantly enhanced our understanding of the human in- teractome and provided valuable datasets for elucidating the functional impact of patient variants on protein-protein interactions (Luck et al. 2020). Luck et al (2020) also showcased the application of HuRI in elucidating the mechanistic effect of missense variants on PPIs within specific disease contexts. Mutations in PNKP have been associated with microcephaly, seizures and developmental delay. They showed that the pathogenic mutation Glu326Lys in PNKP disrupted the interaction with TRIM37 predominantly expressed in the brain. Here, these studies demonstrated that systematically generated human interactome maps may significantly help in variant characterization. In parallel, the Bioplex project generated a reference human inter- actome using the AP-MS technique. This study involved systematic protein purification and bound potential binding partners from cells, fol- lowed by a mass spectrometric analysis of protein complexes. BioPlex mapped about 120000 direct and indirect protein interactions (Huttlin et al. 2021; Huttlin et al. 2017). In addition, it was also employed for variant characterization, identifying how mutations affect not only direct interactions but also complexes relevant to diseases. As a result, it offers a broader view of variant functional impact on protein complexes compared to Y2H. However, Y2H-detected interactome might be more useful for studying variant effects, as it provides binary protein interac- tions essential for comparative PPI profiling. This approach tests how mutated protein affects each specific interaction, enabling the creation 12 of mutation profiles and their comparison with those of the wild-type protein and its partners (Idrees et al. 2024). Application of edgotyping strategy Several successful attempts were made to perform this approach (Sahni et al. 2015; Siwei Chen et al. 2018; Wierbowski et al. 2018; Starita et al. 2018; Fragoza et al. 2019). For example, Sahni et al (2015) generated interaction profiles for 460 mutant proteins and their 220 wild-type counterparts and found 521 perturbed interactions out of 1,316 PPIs using the yeast two-hybrid (Y2H) interaction assay. This huge experimental effort led to the identification of 197 mutations, where 26% identified as complete loss of interaction, 31% as edgetic and 43% had no change in PPIs. Later Fragoza et al (2019) employed the same assay and identified 298 out of tested 1676 missense population variants that disrupted 669 human PPIs. They also used follow-up experiments to further elucidate the effect of mutation on protein function. Taken together these attempts showcase how shared disruption profiles can be used to prioritize candidate disease-associated mutations. Current challenges in edgotyping While this approach holds significant potential for addressing the is- sue of uncharacterized variants, it is still too expensive and laborious, if it is entirely based on experiments given the amount of VUS that needs to be characterized. While current tools like PolyPhen-2 and MutPred2 predict variant pathogenicity primarily use metrics such as conservation score or sequence-based features related to protein struc- ture and function fail to capture the effect of mutations occurring in less conserved but yet functional regions or rare and ultra-rare variants with low conservation scores (Sunyaev et al. 1999; Adzhubei et al. 2010; Livesey et al. 2022). The recently developed AlphaMissence tool excels in performance but also shows less effectiveness for variants in these regions (Cheng et al. 2023). Given these limitations, com- paring edgetic profiles of benign, pathogenic with VUS variants might not always be sufficient to identify functional variants potentially con- tributing to the disease. How can PPI profiling be improved for more effective variant characterization? To predict the variant effect on PPIs, one needs ideally to know the exact residues that constitute the protein interaction interfaces (see sections 1.2 and 1.3). Access to this information would be extremely useful, as it helps pinpoint exactly where and how a mutation might 13 disrupt an interaction and elucidate the mechanistic effect and poten- tial impact of a variant on disease development. This assumption is supported by the study, where they reported a significant enrichment of disease mutations found on the PPI interfaces (wang_2012). Although we have protein interaction datasets available, they carry only binary information, while the information on PPI interfaces is currently miss- ing. Various experimental approaches such as X-ray crystallography, nu- clear magnetic resonance (NMR) spectroscopy, cryo-electron microscopy (cryo-EM) and protein fragmentation exist to detect PPI interfaces at different resolutions (Martino et al. 2021). However, experimental methods are labor-intensive and time-consuming. Indeed, only a small fraction of interactions, about 4 % in the HuRI dataset have solved structures (Luck et al. 2020). Given the limitations of experimental studies, computational methods to predict PPI interfaces have gained traction in recent years. The idea is to increase the predictive power and map PPI data with putative PPI interfaces that further accelerate the experimental validation of the putative PPI interfaces (see section 1.4). Finally, this information will be used for PPI profiling described earlier. 1.2 Modular architecture of proteins The prediction of PPI interfaces requires an understanding of protein architecture to identify functional sites along with databases of known functional sites to enhance the accuracy of these predictions. This will be discussed in this section. Proteins are complex molecules that play a crucial role in cellular biological processes. Since the advent of molecular biology, we learnt that proteins do not function in isolation, but in constant interactions with one another or other molecules (i.e. RNA, DNA) forming complex networks. These interactions are mediated by PPI interfaces formed by specific regions within protein sequences, widely known as functional modules (Campbell et al. 1991). These modules broadly can be cat- egorized into defined and undefined structures. The defined structures, commonly known as the globular domains, are the regions in protein sequences that often independently fold into a stable tertiary protein structure (Copley et al. 2002; Björklund et al. 2005). Those re- gions that lack a defined structure are termed intrinsically disordered regions or IDRs, where short linear motifs (SLiMs) are typically found (Tompa et al. 2014; Davey et al. 2011; Davey et al. 2012). These two types of functional modules will be explained further in the following 14 sections. The modularity of the proteins is a crucial aspect of protein evolu- tion and functionality (C. Vogel et al. Year; Han et al. 2004). This modularity allows combining different modules to make proteins with multiple properties and functions, facilitating the diversity of new traits and adaptation to environmental changes (Apic et al. 2001). Around 65-70 % of proteins in eukaryotic organisms are composed of multiple modules in their proteomes (Han et al. 2004). One prominent exam- ple is the well-characterized Nuclear factor NF-kappa-B p105 subunit or NFKB1, a multifunctional hub protein and transcription factor in- volved in various cellular processes, such as transcriptional regulation, immune response, cell proliferation and survival (Gilmore 2006; Hay- den et al. 2008). It has been implicated in a broad range of cancers, neurodegenerative diseases, and inflammatory and autoimmune diseases (Gilmore 2006; Hayden et al. 2008; Taniguchi et al. 2018). The N-terminus of 968 amino acid-long protein starts with the Rel homology domain (RHD), followed by Ankyrin repeats and Death domain (DD) at the C terminus (Williams et al. 2001; Glover 2004; J. Wang et al. 2023). The disordered parts of the protein harbor many known motifs such as nuclear export and nuclear localization signals, docking and kinase modification motifs (Koonin 1996; Chen et al. 1996; Rodŕıguez et al. 2000; Hsu 2007). Thus, the modularity in NFKB1 enables it to interact with many different partners, function in various cellular processes and exemplify the complexity of the phenotypes that can arise from the interplay of functional modules (Figure 1.5). 1.2.1 Folded domains Biological role of domains The foundational understanding of protein domains began with the work of structural biologists Linus Pauling and Robert Corey in the 1950s. Their research identified alpha helices and beta sheets as secondary structures within proteins. Wetlaufer and Ristow (1973) introduced the concept of protein domains or functional modules in their review of X-ray crystallography studies of enzymes like lysozyme and immunoglobulins (Blake et al. 1965; Freedman et al. 1966). They associated domains with the regions, typically ranging from 50 to 350 amino acids in length that are capable of folding autonomously. This understanding was facil- itated by the development of experimental methods like crystallography and NMR, which accelerated the identification and classification of these protein modules, including commonly found domains in proteome such 15 Figure 1.5: Modularity of Nuclear factor NF-kappa-B p105 subunit. Do- main and motifs as functional modules schematically illustrated in NFKB1. The top panel shows the modularity of NFKB1. The numbers above and below the boxes denote the boundaries of domains. The bottom panel displays the full-length struc- ture of NFKB1 predicted by AlphaFold. Domains and motifs in the structure are colored according to their colors in the top panel. RHD stands for Rel Homology Domain, Ank stands for Ankyrin repeats, and DD stands for death domain. as WD40, SH2, SH3, ANK, RING, PH and PDZ (Copley et al. 2002). Domains fold into three-dimensional (3D) structures to achieve ther- modynamic stability, positioning the hydrophobic residues in the pro- tein core while exposing hydrophilic residues on the surface (Dill et al. 2012). This ensures that the native conformation is the most en- ergetically stable for the domain. Domains form the functional units of a protein, enabling it to interact with partners and perform cellular functions. For instance, the protein KLHL41 is involved in the ubiquitin- proteasome system, which regulates protein turnover and degradation, maintaining various biological processes like muscle development and the function (Ramirez-Martinez et al. 2017; Yuen et al. 2020). KLHL41 consists of three main domains: the Broad-complex, Tram- track, Bric-a-brac (BTB) domain, BTB and C-terminal Kelch (BACK), 16 and Kelch repeats (Figure 1.6). The Kelch repeats of KLHL41 form a beta-propeller structure that recognizes substrates such as nebulin (NEB), a giant muscle protein that acts as a molecular ruler for filament length and regulates actin-myosin cross-bridge cycling during skeletal muscle contraction (Yuen et al. 2020). Upon binding, KLHL41 forms a complex with NEB, while the BTB domain of KLHL41 directly binds with cullin 3 (Cul 3), a scaffold protein in the Cullin-RING ubiquitin ligase (CRL) complex, and can dimerize with itself to provide more stability to the complex. Additionally, the BACK domain at the C- terminus of KLHL41 supports and stabilizes the complex (Stogios et al. 2004; Dhanoa et al. 2013; Gupta et al. 2014). Once the sub- strate is formed, the ubiquitin molecules are transferred to the substrate subunit, marking the substrate protein for the degradation by the pro- teasomal system. This case illustrates how linking different domains together in one polypeptide chain allows KLHL41 to maintain protein homeostasis. Figure 1.6: Domain architecture of Kelch-like protein 41 (KLHL41). The top panel shows the schematic domain organization of KLHL41 (not drawn to scale). The numbers above and below the boxes denote the boundaries of domains. The bottom panel shows the putative full-length structure of KLHL41 as predicted by AlphaFold. Domains and motifs in the structure are colored according to their colors in the top panel. BTB stands for the Broad-complex, Tramtrack, Bric-a-brac domain and BACK - for BTB and C-terminal Kelch domain. While some domains can achieve stability independently or through dimerization, others require assistance from zinc and metal ions or disul- fide bridges. For instance, the zinc finger domain maintains its confor- mation by binding to zinc ions. These zinc ions typically interact with cysteine and histidine residues, acting as an anchor that reduces the protein chain flexibility and supports the stable 3D structure (Berg et al. 1997; Klug 2010). This stabilization is important for protein functions such as DNA binding and gene expression. A good example is the IKZF1 protein, also known as Ikarios, a zinc finger protein and tran- 17 scription regulator that plays a crucial role in lymphocyte differentiation and function. It contains four C2H2-type zinc finger domains at the N terminus that bind to zinc ions. The stabilized structure of a protein interacts with DNA sequences in the promoter regions of targeted genes and regulates their transcription. Different combinations of protein do- mains exist widely across proteomes due to natural selection, acting on these modular units to create diverse molecular machinery (Doolittle 1995). Gene duplication and shuffling by recombination are likely to be the driving forces of protein evolution and the complexity of the proteome. While gene duplication leads to the emergence of similar domains oc- curring in unrelated proteins, recombination enhances versatility and allows proteins to specialize in specific cellular functions tailored to an organism’s needs (Bagowski et al. 2010). For example, the PDZ do- main is a 90-100 residues long structurally conserved module, found in a vast array of proteins involved in diverse signaling pathways and cellular polarity (Harris et al. 2001; Lee 2010). About 270 PDZ domains are distributed over 150 proteins (Wang et al. 2010; Velthuis et al. 2011). Despite their conserved structural fold, these domains exhibit se- quence divergence that contributes to functional specificity. Thus, PDZ domains in the protein PSD-95 recognize C-terminal motifs on its tar- get proteins, whereas the PDZ domain in cell polarity protein PAR6 was shown to interact with internal ligands and other PDZ domains can form homodimers (Kornau et al. 1995; Zhang et al. 2009; Fouassier et al. 2000). Protein domain databases Previously, a big contribution to the discovery of protein domains was done by sequencing projects. This effort helped to identify the conserved regions across different proteins. With the power of bioinformatics, the domains became identifiable using Hidden Markov models (HMMs). HMM is a statistical model used to classify protein families based on multiple sequence alignments (MSA) and detect sequence homology for the identification of conserved regions within proteins (Bystroff et al. 2008). Thus, HMM became the main approach to collecting the data and generating domain databases. For example, the Protein families database (Pfam) and Simple Mod- ular Architecture Research Tool (SMART) computed HMMs to build protein and domain families based on the sequence similarity (R. D. Finn et al. 2014; Schultz et al. 1998; Letunic et al. 2021). While the SMART database has manually curated sequence alignment 18 which helped to define the domain boundaries more precisely, Pfam em- ploys the automated approach and covers a broader range of domains (Paysan-Lafosse et al. 2023). 1.2.2 Intrinsically disordered regions Biological roles In the mid-20th century, protein research primarily focused on folded and ordered proteins. Studies on enzymes, where the denatured proteins loose their catalytic activity, demonstrated the relationship between pro- tein structure and function (Northrop 1930). It was assumed that a protein requires a native folded structure to perform biological func- tions. Thus, the protein structure-function paradigm was established, while the abundance and functional role of disordered regions in proteins in eukaryotes was unrecognized. However, unexpected behavior of pro- teins such as missing electron density in X-ray crystallography studies, increased sensitivity in the in vitro proteolysis experiments and solubil- ity issues during protein purification processes led to the reassessment of the structure-function paradigm. Pioneering work by Dunker (2002) and Urevsky (2005) revealed that disordered regions are common in eukary- otic proteins. Further studies challenged the long-standing belief that protein functionality was strictly dependent on a well-defined and folded protein structure (Tompa 2002; Dunker et al. 2002; Wright 1999; Iakoucheva et al. 2002; Uversky et al. 2005; Uversky 2014). It was shown that disordered regions of many regulatory and signaling proteins can undergo disorder-to-order transitions upon binding to their targets, which adds a layer of regulatory control and allows for complex interactions (Uversky 2014). As a result of these findings, the scientific community began to recognize the importance of protein disorder, lead- ing to a significant shift in understanding protein biology. Consequently, the paradigm was shifted to the "disorder-function paradigm". Intrinsically disordered regions (IDRs) lack persistent 3D structure under physiological conditions, continuously adopting the wide range of dynamic conformations and forming transient secondary structures (Wright 1999; Tompa 2011; Davey et al. 2019). These regions are abundant in eukaryotic proteins, with predictions indicating that they cover 30-40% of residues in their proteome (Tompa 2012; Van Roey et al. 2012). IDRs also significantly contribute to the diversity and ver- satility observed in organism evolution (Davey et al. 2015; Babu et al. 2012; Weatheritt et al. 2012). In addition, they are often found to overlap with post-translational modifications (PTMs), contributing 19 to functional versatility (Tompa 2012; Tompa et al. 2014). These modifications can alter the conformation, stability and interactions me- diated by IDRs (Uversky 2014). Due to the dynamic behavior, IDRs are commonly involved in transient interactions regulating signal trans- duction processes (Dyson 2005; Davey et al. 2019). A crucial finding was that IDRs are enriched with functional interaction modules, such as short linear motifs (SLiMs) mediating different multivalent interactions, which will be discussed in the next section. As IDRs play a significant role in signaling and cell regulation, they are tightly controlled, and mutations in disordered sites have been asso- ciated with human diseases, including cancer, diabetes, cardiovascular and neurodegenerative disorders (Iakoucheva et al. 2002; Babu et al. 2011). Vacic et al. (2012) investigated disease-causing missense mutations on ordered and disordered regions and compared them to neutral variants observed in healthy individuals without causing disease phenotypes. They found that over 20 % of pathogenic variants reside in intrinsically disordered regions and interfere with their functions. In addition, the study by Peng et al (2012) emphasizes the importance of understanding the context-dependent behavior of IDRs. They high- lighted that the functional outcome of missense variants in these regions could vary depending on the cellular environment and interaction part- ners (Peng et al. 2012). Despite the biological relevance of IDRs, only a small fraction of IDRs have been characterized (M. Gouw et al. 2017; Davey et al. 2019). Experimentally, defining disordered regions remains challenging. Due to the dynamic structures of IDRs, the use of sophisticated methods such as NMR, small-angle X-ray scattering (SAXS), circular dichroism (CD) or Förster resonance energy transfer (FRET) is required (Felli et al. 2015; Holmstrom et al. 2016). Moreover, these regions function in a context-dependent manner based on the cellular milieu including pH, PTMs, and the presence of other proteins (Oldfield et al. 2014; Wright 2015). These challenges necessitate integrative approaches that combine experimental data with computational predictions. As a result, various computational approaches have been developed to pre- dict IDRs in proteins, leading to the generation of several databases containing putative IDRs and experimentally verified. Databases of disordered proteins and tool to predict the dis- orderness The DisProt is a comprehensive repository of experimentally verified entries of proteins or regions within proteins that lack a stable three- 20 dimensional structure under physiological conditions, where each entry is manually curated. The DisProt database annotates the disorder and molecular functions curated from experimental studies. More than 2,000 eukaryotic intrinsically disordered proteins (IDPs) and 6,000 IDRs are documented in this database. IDRs possess distinctive characteristics that set them apart from structured regions. One notable characteristic is the enrichment in polar hydrophilic residues coupled with the depletion of hydrophobic amino acids that help to stay soluble and flexible in the disordered state, and incapable of forming sufficient interresidue interaction within a protein. To discriminate between ordered and disordered sequences, the Intrinsic Unstructured Protein Predictor tool (IUPred) developed the approach, where they calculated the likelihood of interaction formations using a statistical interaction potential (Z. Dosztányi 2018). These potentials are further used to assess each residue in the protein sequence to esti- mate their energies. Based on the energy, the residues estimated to have the most favorable energies are predicted to be ordered, while those with unfavorable energies are predicted to be disordered (Mészáros et al. 2009). Recently, a new powerful tool AlphaFold2 (AF2) has emerged, pre- dicting protein structures with accuracy comparable with experimental structures (Jumper et al. 2021). AF2 predicts a full-length protein structure generating the confidence score termed as Local Distance Dif- ference Test (pLDDT). pLDDT score is calculated for each residue in the protein structure, where it ranges from 0 to 100. A high score means greater confidence in the accuracy of the prediction of a residue’s position. Interestingly, the pLDDT was found to correlate with the dis- ordering tendency, which can be used as a potential feature to predict disorder (Wilson et al. 2022).Another feature is the solvent-accessible surface area (SASA) of each residue is also correlated with the disorder propensity of residues. One study used both pLDDT and SASA and smoothed over a 20-residue window and outperformed IUPred2A, the latest version of the predictor tool in their study (Akdel et al. 2022). As AF has been used for other applications, they will be described in section 1.4. 1.2.3 Short linear motifs Biological roles Short linear motifs (SLiMs) represent dynamic functional sequences, ranging from 3-23 amino acids long. On average four residues are con- 21 served in the motif consensus sequence, but the remaining positions are completely variable (Davey et al. 2012). Motifs typically lie in IDRs or more rarely in disordered loops of structured regions and possess reg- ulatory functionality such as directing ligand binding, providing docking sites for enzymes and targeting proteins to specific subcellular locations (Davey et al. 2012; Van Roey et al. 2014). The concept of SLiMs appeared in the late 20th century. In 1980 Aaron Ciechanover, Avram Hershko and Irwin Rose identified degrada- tion motifs or degrons that direct the target proteins to the ubiquitin- proteasome system for degradation. Their groundbreaking work earned them a Nobel Prize in 2004 and laid the foundation for discovering new motifs. In 1990, Tim Hunt identified targeting signals such as KDEL en- doplasmic reticulum retention motif, and the positively charged nuclear and targeting sequences, while Pawson et al. (1986) discovered that Src domains recognize motifs within protein partners, the interactions with which regulate signaling pathways. These studies highlighted the im- portance of motifs in protein function and regulation, opening avenues for further exploration and discovery in molecular biology and cellular physiology. The discovery and validation of SLiMs have been performed by vari- ous experimental methods such as traditional low-scale X-ray crystallog- raphy and NMR as well as high-throughput systematic approaches such as peptide microarrays, and phage display. Along with experimental dis- coveries, the computational approaches have also significantly advanced motif research. The motif detection techniques will be discussed in more detail in sections 1.4 and 1.5. It is estimated that more than 100,000 binding motifs exist in the hu- man proteome, with many being uncharacterized (Tompa et al. 2014). The discovered motifs are categorized into six classes based on their bi- ological roles: ligand-binding sites, modification, targeting signals, de- grons, docking and cleavage. Modification motifs include PTM sites like phosphorylation. Targeting signals like nuclear localization signals (NLS) are involved in protein trafficking to specific cellular compart- ments. Ligand-binding motifs interact with binding partners to form transient signaling complexes. Docking motifs facilitate substrate recog- nition by enzymes without affecting the active site of these enzymes. The cleavage motifs are recognized by proteases that cleave the protein at the cleavage site (Van Roey et al. 2014). Another functional type of motif is degron. Degrons, such as those, found in the protein AFF4 (Figure 1.7), are important for protein regulation. Specific ubiquitin ligases like SIAH1 recognize these motifs which tag the target proteins with ubiquitin molecules. This tagging process, known as ubiquitination 22 marks the protein for degradation by the proteasome system (Oliver et al. 2004). Figure 1.7: Degron motif on AFF4. The top panel shows the schematic domain organization of AFF4 (not drawn to scale). The numbers above and below the boxes denote the boundaries of domains. AFF4 contains a degron motif that is recognized by ubiquitin ligase. The bottom panel shows the putative full-length structure of AFF4 as predicted by AlphaFold 2. Domains and motifs in the structure are colored according to their colors in the top panel. CHD stands for C-terminal homology domain. Moreover, they mediate transient regulatory and signaling interac- tions involved in biological processes like cell signaling, protein home- ostasis and cell cycle. For instance, the 14-3-3 binding motif facilitates the interaction of diverse proteins with 14-3-3 domains, thereby regulat- ing their subcellular localization and activity of 14-3-3 proteins. Another example is the SH3-binding motif (PXXP), found in numerous signal- ing proteins, which mediates interactions with SH3 domains of other proteins, facilitating the assembly of signaling complexes in response to extracellular stimuli (Davey et al. 2012; Van Roey et al. 2014). Additionally, SLiM mimicry can be used by viruses to interfere with the host cellular machinery and thereby repurposing the host cell for pathogen reproduction (Davey et al. 2011; Uyar et al. 2014). For example, the Nsp3 protein of Eastern equine encephalitis virus (EEEV), contains the motif LITFD that mimics the classical clathrin box motif. This mimicry allows Nsp3 to interact with the beta-propeller repeat of the N-terminal domain of clathrin (CLTC). This interaction disrupts clathrin-mediated receptor trafficking and interferes with the signaling processes, potentially suppressing antiviral signaling or altering cellular 23 functions to create a more favorable environment for viral replication (Mihalič et al. 2023). As opposed to globular domains, SLiMs are short functional peptides and take up a very small sequence space. Consequently, IDRs can be densely packed with multiple SLiMs, which can sometimes overlap and act as regulatory switches. There are different switch mechanisms. One of the mechanisms is switching the specificity of protein to its binding partners like modification-dependent modulation of the intrinsic affinity of the motif. The protein integrin beta 3 is the cell surface receptor involved in cell adhesion and cell signaling (Tadokoro et al. 2003). The NPxY motif in the disordered tail of the integrin beta 3 subunit preferentially interacts with the PTB domain and membrane proximal region of talin necessary for the integrin activation (Wegener et al. 2007). However, phosphorylation of the motif, particularly at posi- tion Tyr747 switches the specificity to PTB of Dok1. Dok1 prefers to bind exclusively to the central motif and does not interact with the membrane-proximal region of the integrin tail necessary for activation. Therefore, this mechanism ensures the control over integrin-mediated cellular processes (Oxley et al. 2008). Computational and experimental studies have shown that pathogenic mutations in disordered regions often affect SLiMs. Uyar and colleagues (2014) performed a proteome-wide analysis of disease-associated muta- tions with a focus on SLiMs. Here, they utilized the mutation data from healthy and patient individuals reported in databases such as Catalog of Somatic Mutations In Cancer (COSMIC), 1000 Genomes Project, and Online Mendelian Inheritance in Man (OMIM), respectively (Consor- tium et al. 2015; Forbes et al. 2011; Hamosh et al. 2005). Next, they mapped these mutations on SLiM derived from the experiment and putative SLiMs using the IUPred tool and compared the distribution of pathogenic and neutral mutations. The analysis revealed that disease- related mutations are significantly enriched on SLiMs within intrinsi- cally disordered regions (Uyar et al. 2014). Additionally, mutations within SLiMs can disrupt motifs or create new ones. The study experi- mentally showed that pathogenic mutants formed dileucine motifs that often lead to clathrin-binding that underlies disease aetiology (Meyer et al. 2018). This accumulated evidence highlights the importance of SLiMs as a key aid to understanding the molecular mechanisms in diseases and underscores the need to integrate SLiM analysis into variant character- ization studies. 24 Motif databases While Pfam and SMART are valuable for predicting domain-involving interfaces, motif databases can help identify potential SLiM-mediated interfaces. The Eukaryotic Linear Motif (ELM) is a comprehensive database developed by Toby Gibson and colleagues in the early 2000s. The ELM database provides researchers with a catalog of manually cu- rated and experimentally annotated validated SLiMs and tools for motif prediction with the main focus on annotation and detection of SLiMs (Puntervoll et al. 2003). Each record provides extensive information on the motif sequence pattern, functional role, interaction partners, bi- ological processes it influences and experimental evidence. Additionally, the database has a search interface that allows users to query the motif based on the sequence pattern, protein identifier, and species. The ELM database categorizes motifs into functional types, classes and instances. There are 6 functional types of SLiMs: ligand-binding (LIG, e.g. WW1 binding motif), modification (MOD,e.g. CK1 phos- phorylation site), targeting (TRG, e.g. NLS classical nuclear localiza- tion signal), docking (DOC, e.g. USP7-binding motif), degradation or (DEG, e.g. Siah binding motif) and cleavage or (CLV, e.g. NRD cleav- age site). These types are grouped into 356 ELM classes based on the binding domain of a partner, specific sequence characteristics, targeted subcellular localization and other functional properties (Kumar et al. 2024). These classes incorporate 4283 individual ELM instances man- ically curated from 4274 scientific publications and 2749 motif-partner interactions (Kumar et al. 2024). Each instance has annotated de- tails on the evidence like the experimental method used to determine and characterize the discovered motif (M. Gouw et al. 2017). ELM curators systematically described each ELM class using a regular expres- sion (RegEx) format to define the key residues important for the binding affinity and specificity of the motif (Davey et al. 2011). These regu- lar expressions also capture the conservation pattern of different motif types and, therefore, can be used in the prediction of putative motifs. 1.3 Domain-motif interfaces Current understanding of protein-protein interaction inter- faces Protein interaction interfaces are formed through the interaction of pro- tein modules, mainly globular domains and motifs. For example, the binding between two globular domains is termed domain-domain inter- 25 face (DDI). DDIs involve multiple contacts and are characterized by a high binding affinity, which contributes to the stability of protein interac- tions (Nooren 2003). DDI interactions aid in stabilizing the formation of protein complexes and are often involved in enzymatic activity, cell signaling, cell adhesion and other cellular events. Later, researchers found that in addition to DDIs, protein domains can recognize SLiMs forming a domain-motif interface or DMI (Dyson 2005; Babu et al. 2012; Tompa 2012; Davey et al. 2012). DMI- mediated interactions are weaker and more transient, playing a role in major biological processes such as signal transduction, protein target- ing to cellular compartments and protein homeostasis (Schreiber et al. 2009; Zhou 2012). Therefore, maintaining these DMI interac- tions is crucial, as their disruption can potentially lead to the disease (Arimura et al. 2000; Uyar et al. 2014). Despite the importance of DMIs, they are significantly underrepresented. Due to the transient nature of these interactions, it is hard to detect using traditional ex- perimental approaches described in section 1.5. Tompa et al (2014) estimated the number of motifs in the hundreds of thousands or even millions. Therefore, the last two decades have seen a tremendous rise of interest in SLiMs interface-mediated PPIs in different research fields like structural biology, systems biology and bioinformatics. In my thesis, I will focus only on the systematic prediction of DMIs followed by experimental validation and will use this information in com- parative PPI profiling as the strategy for efficient variant characteriza- tion. Functional significance of Domain-Motif interfaces in cellular processes and disease In this section, I will describe several examples highlighting the func- tional role of DMI interactions in biological processes and their implica- tions for the disease. A notable example of these is the degron motif with the pattern Px- AxVxP, where x represents any amino acid) is found in the target protein AFF4. This protein plays a critical role in transcription regulation and chromatin remodeling and it is a core component of the super elongation complex (SEC). SEC facilitates the efficient synthesis of mRNA tran- scripts by RNA polymerase II (RNAPII) during transcription elongation (Lin et al. 2010; C. Luo L. et al. 2012). This protein also helps to recruit RNAPII to gene promoters and overcome the transcriptional pausing. This activity is crucial for ensuring proper gene expression profiles and supporting cellular function (Lin et al. 2010). The degron 26 motif of AFF4 is recognized by the substrate-binding domain (SBD) of E3 ubiquitin ligase, SIAH1. SIAH1 is the central component of a multiprotein Er ubiquitin ligase complex and essential for protein level regulation within the cell. It has been implicated in the regulation of programmed cell death. In some studies, it has been identified as a tumor suppressor as it can degrade the oncogenic proteins. This helps to prevent tumor formation and progression. The recognition of AFF4 by SIAH1 has been previously functionally annotated (Oliver et al. 2004). Upon binding this motif forms a beta strand parallel to the beta-sandwich fold of the substrate binding domain (SBD) of SIAH1. This interaction is known as the beta augmentation mechanism. When the SBD of SIAH1 contacts the degron of AFF4, it facilitates the ubiquitination of AFF4. Then this tagged protein is degraded by the proteasome complex (Figure 1.8). This biological process is important for maintaining homeostasis in the cell by removing damaged and misfolded proteins and regulating protein levels within the cell (Santelli et al. 2005). While the mechanism of the interface between these proteins has been annotated, the exact mechanism underlying the development of these disorders is poorly understood, and many mutations found on these in- terfaces remain uncharacterized. For example, the Met260Thr variant, where methionine is mutated to threonine within the motif of the pre- viously mentioned AFF4. The mutation was found in the patient with a rare NDD called CHOPS syndrome and reported in Clinvar as VUS. However, the diagnosis of CHOPS syndrome, caused by this rare mu- tation is complicated. The limited number of documented cases makes establishing diagnostic criteria and developing personalized treatment difficult. Using our approach we know that the mutation is sitting on the motif of AFF4 and might perturb the interaction with SIAH1. The disruption of interaction may lead to the stabilization and accumula- tion of AFF4 and cause developmental abnormalities characterizing the disease. Another example of the domain-motif mediated interaction is the interaction between the 14-3-3 domain proteins and phosphorylated lig- and motifs Figure 1.9 on the target proteins (Grozinger et al. 2000; M. J. Wang K. et al. 2000). YWHAG (14-3-3 protein gamma) is one of the proteins possessing a 14-3-3 domain which recognizes phospho- rylated serine residues within the RAQSSP, RTQSAP and RKTASEP consensus motifs of histone deacetylase 4 (HDAC4). This interaction is known and the motif binding to 14-3-3 proteins was first described in 1997 by Yaffe et al. YWHAG is an adaptor protein localized in the cy- toplasm. It belongs to the 14-3-3 protein family involved in signal trans- 27 Figure 1.8: The mechanism of interaction between SIAH1 and its target partner AFF4. Substrate-binding domain (SBD) binds to the degron motif on AFF4, which leads to the ubiquitination of AFF4. Tagged protein is further degraded by the proteasomal system (Oliver et al., 2004). CHD stands for C-terminal homology domain. The structure of the interface is shown as predicted by AF2. duction, protein localization, cell apoptosis and cell cycle. This protein plays a crucial role in signaling pathways by binding to the phosphory- lated motifs of its interacting partners. One of these proteins is HDAC4, a transcriptional regulator, which deacetylates lysines at the N-terminal region of the core histones H2A, H2B, H3, and H4 in the nucleus. The previous studies described the mechanism of interaction and regulation of HDAC4 and HDAC5 by YWAHG. In the inactive state, phospho- rylated deacetylases are located in the cytoplasm, where they bind to the 14-3-3 domain of YWHAG via three phosphorylated sites. These interactions lead to the sequestration of HDAC4/5 to the cytoplasm (Grozinger et al. 2000; M. J. Wang K. et al. 2000). This keeps HDAC4 from entering the nucleus and repressing the transcription of genes important for different functions like neuron development (M.-S. Kim et al. 2012; Pennington et al. 2018). YWHAG is linked to a type of developmental and epileptic encephalopathy that is character- ized by neurodevelopmental impairment and the onset of seizures lead- ing to delays in cognitive and motor development, whereas mutations in HDAC4 are found in patients with neurodevelopmental disorder with central hypotonia and dysmorphic facies (NEDSHF), brachydactyly and intellectual disability. To illustrate how understanding the interaction mechanism can be informative about the variant impact and the poten- tial cause of the disease, consider the Glu247Gly mutation within the RKTASEP motif in HDAC4 is associated with NEDSHF (Wakeling et al. 2021). This mutation is documented as a pathogenic missense variant in the ClinVar database. It is not reported in gnomAD and has been determined as a de novo mutation. It was functionally studied, where immunoprecipitation with HDAC4 with the Glu247Gly mutation 28 in HEK293 cells demonstrated a reduced binding affinity for another 14- 3-3 protein, YWHAB (Wakeling et al. 2021). As the PPI interface is the same as with the YWHAG protein, we can assume this mutation might also disrupt the interaction with the 14-3-3 domain like YWHAG. By knowing the mechanism of interaction we can hypothesize that the resulting reduced binding or loss of interaction with YWHAG may lead to the impaired nuclear export of HDAC4, causing abnormal expression of genes and contributing to the disorder. Figure 1.9: Model of activity of HDAC4 through the interaction with 14-3-3 domain protein. Upon phosphorylation of HDAC4, the phosphorylated ligand motif is recognized by the 14-3-3 domain. This domain-motif interaction leads to the sequestration of HDAC4 and HDAC5 to the cytoplasm, preventing them from downregulating gene transcription (Grozinger et al., 2000). 1.4 Predicting the known occurrence of DMIs in protein interactions using sequence-based ap- proaches The most efficient way to characterize DMIs would involve sequence- based analysis and structural modeling. This combined approach in- cludes two steps: 1) using sequence-based predictions to identify poten- tial contact residues between proteins, and 2) structural modeling to visualize and pinpoint inter-atomic interactions at the interface. Fur- thermore, the predicted structural model of the putative interface can aid in the experimental validation by designing the mutations assumed to perturb the binding between the interacting regions. I will discuss this part in more detail in Section 2, Article II. One way to predict DMIs is by identifying the instances of known DMI types. Databases like ELM contain a catalog of high-quality DMI 29 types manually curated based on experimental evidence. As ELM em- ploys the regular expression patterns (see section 1.2.3) and HMMs of the corresponding binding domains, it can help find known occurrences of similar domains and motifs in the protein interactome (Weatheritt et al. 2012; Edwards et al. 2014; Gouw et al. 2018). The interaction of Eukaryotic Linear Motif (iELM) is the web server that employs the annotated motifs from ELM and PPI data to iden- tify putative SLiM-mediated interactions extracted from the STRING database (Weatheritt et al. 2012). The iELM first checks for domain- domain interfaces using the 3did, the DDI database (Mosca et al. 2014). If DDI is found, then the search stops. If no such interface is found, it predicts motifs by employing ELM resource regular expressions and aligning the sequence of the queried protein with their orthologs. Predictions are scored using the SLiMSearch algorithm based on mo- tif conservation (Davey et al. 2011). Next, putative motifs and the flanking regions are evaluated for the intrinsic disorder propensity by the IUPred tool (Dosztányi et al. 2005). Concurrently, motif-binding domains are detected via the HMMSearch and optionally using Pfam HMMs (J. Finn M. et al. 2010). The E-value derived from the HMM match, conservation, and disorder score of identified motifs is used to train a support vector machine to evaluate putative DMIs. If templates for the putative DMIs are available, structural modeling is performed by PepSite, which scores the biophysical feasibility of modeled DMIs (Pet- salaki et al. 2009). The benchmarking iELM achieved a sensitivity of 84.8 % and a specificity of 86.5 % on its test set (Weatheritt et al. 2012). Despite its good performance, the evaluation of iELM was done on the imbalanced dataset, where the number of negative points outnumbers the positive data points by almost 30-fold. Also, iELM halts the search of potential DMIs, if any domain-domain interface type is found. Since DMIs and DDIs are not mutually exclusive and can act synergistically in interactions this approach may overlook potential DMIs. Moreover, iELM builds HMMs tailored to specific motif-binding domains using hand-curated sets of known sequences. This approach carries the risk of overfitting, as HMMs can become too specialized for a narrow domain data set. Additionally, iELM was not updated and is no longer in use. These limitations motivated my former colleague to develop a DMI pre- dictor tool, that I applied and experimentally validated putative DMIs. The workflow of the tool and its application will be covered in Chapter 3. While DMI interface predictions can be made, systematic experimen- tal validation has to be done. Below, I will discuss various large-scale 30 methods and suggest suitable assay for the proposed strategy. 1.5 Systematic experimental validation of putative DMIs Today, various high-throughput methods for the systematic discovery of PPIs have been developed. Validation of putative interfaces can be done by using PPI interaction assays that quantify the effects of mutations on PPIs, where mutations, for example, were designed to validate predicted interfaces or were found in patients. When mutations, designed to validate predicted interfaces or identified in patients, reduce or eliminate binding compared to the wild-type, it suggests that the interface is involved in the interaction. However, the disruptive effect on the interaction by mutation can be caused by other reasons such as partial misfolding, or complete un- folding leading to the destabilization of the protein or its degradation. Alternatively, it can cause the mislocalization of the other subcellular compartment and/or further lead to protein degradation. Therefore, it is essential to use a method that allows monitoring of protein expression levels and provides a quantitative score indicating the binding strength of interactions. In this section, I describe different in vitro and cell-based methods ca- pable of identifying PPIs, and potential assays suitable for experimental validation of putative DMIs. PPI methods are broadly classified into binary methods or co- complex methods (Table 1, 1-2). For example, AP-MS is known for its scalability in the systematic interaction mapping (see section 1.1). Due to the design and principle of the method to detect protein asso- ciations rather than direct PPIs, it would not be effective for domain- interface validation. Moreover, this assay may fail to detect transient or weak interactions during lysis and washing steps. On the other hand, in-vitro methods like ITC, SPR, FP and MST (see Table 1, 3-6) detect likely direct interactions and pro- vide real-time information on the binding affinity of these PPIs (Ward2001; Stahelin2013; Pierce et al. 1999). While these meth- ods are quantitative and can assess the effect of mutations on interac- tions, they require purified proteins, which can be time-consuming and expensive equipment making these assays less scalable. Due to complica- tions in the purification step, only potentially binding protein fragments are used, making it unclear how the interaction occurs in a full-length context. Additionally, since these assays operate outside the native cel- 31 lular context, the validation of domain-motif interactions (DMIs) in cells remains uncertain. Another method is Cross-linking (XL-MS), performed in both in vitro and in cell-based systems (see Table 1, 7) is valuable for discovering new interfaces, as it captures contact residues in close proximity and provides structural insights. However, it is less suited for interface val- idation. For instance, the washing step may fail to catch DMI-driven PPIs and inefficient cross-linkers may capture intra-protein contacts, complicating the analysis. Designing mutations for validation can be challenging, as the cross-linkers target specific residues. This method does not allow for comparing the effect of mutation on the binding affin- ity of PPIs compared to the wild-type proteins. While useful for dis- covering new interfaces, XL-MS is not suitable for validating interaction interfaces. 32 33 Able to test potential Able to study the effect of Able to measure Assay Name Type Assay is based on… Assay detects… Scalable? effect of mutation on mutation on binding protein expression Able to check specific PPI? affinity of specific PPI? levels? protein localization? Comments Affinity Purification (AP)-Mass In vitro Affinity purification* of a bait protein along with its prey *protein purification might be time-consuming and large sample amounts are (associated) partners, followed by MS Protein complexes Yes No No No No requiredSpectrometry (MS) Co-immunoprecipitation (Co-IP) In vitro The use of specific antibodies* to pull down a target protein -Mass Spectrometry (MS) along with its prey (associated) partners,, followed by MS Protein complexes No No No No No * Expensive (e.g. due to the need for specific antibodies) Isothermal Titration Calorimetry In vitro Measuring heat changes if two proteins interact Likely direct PPI No Yes Yes No No (ITC) Surface Plasmon Resonance In vitro Measuring changes in refractive index to quantify binding Likely direct PPI Yes Yes Yes No No (SPR) Microscale Thermophoresis In vitro Measuring the thermophoretic movement of molecules in a (MST) temperature gradient to quantify binding. Likely direct PPI No Yes Yes No No Fluorescence Polarization (FP) In vitro Measuring changes in the polarization of fluorescent light emitted by a fluorophore. Likely direct PPI No Yes Yes No No Cross-linking Mass Spectrometry In vitro / Using chemical cross-linkers to capture protein-protein *Not suitable for this (or the mutation design is quite complicated, as cross-linkers Cell-based interactions, followed by mass spectrometry to identify cross- Likely direct PPI Yes No* No No** No** recognise specific residues, therefore the mutation has to be done or occur on them) (XL-MS) linked peptides. **Can be if it is cell-based assay, where proteins are tagged followed by measurement of the tag signal (e.g. fluorescence) Cell- DNA-binding and activation domains fused to interacting *Proteins are forced to be in the nucleus of the yeastYeast Two-Hybrid (Y2H) based* proteins Likely direct PPI Yes Yes No No** No** **If the proteins are tagged prior to transformation and checked by flow cytometry and microscopy Protein Fragment In vitro / The reconstitution of a transcriptional activator when two Complementation Assay (PCA) Cell-based proteins of interest interact Likely direct PPI Yes Yes Yes No No using proximity-dependent ligation of oligonucleotide- Proximity Ligation Assay (PLA) Cell-based conjugated antibodies to create a signal that is amplified and Likely direct PPI No Yes No No No quantified if the proteins are within close proximity. Fluorescence Resonance Energy In vitro / Cell-based Measures energy transfer between two fluorophores Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imagingTransfer (FRET) Bioluminescence Resonance In vitro / Detecting the energy transfer between a bioluminescent donor Energy Transfer (BRET) Cell-based and a fluorescent acceptor when they are in close proximity. Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imaging MAPPIT (Mammalian Protein- Reconstituting the JAK/STAT signaling pathway through the Cell-based interaction of bait and prey proteins, leading to reporter gene Likely direct PPI Yes Yes Yes No No Protein Interaction Trap) activation Cell-based Luminescence-based two-hybrid followed by Cell- BRET based assay followed by Co-IP Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imagingassay (LuTHy) free Table 1 The overview of different in vitro and cell-based methods to detect PPIs. In parallel, cell-based methods have been developed. Cell-based bi- nary methods detect PPIs mostly based on co-expression of genetically tagged proteins. If these proteins interact, their tags come into proxim- ity, producing various readouts to indicate a PPI. For example, common read-outs include the reconstitution, activation or expression of reporter proteins. A well-known example is the Yeast Two-Hybrid (Y2H) assay, where the DNA-binding domain is fused to a bait protein, and the tran- scription activation domain is fused to a prey protein (Chien et al. 1991; Fields et al. 1989). When the bait and prey interact, the transcription factor is reconstituted, activating the reporter gene. The presence of interaction is indicated by the activation of the reporter gene and the growth of the yeast. While Y2H has been attempted to be used for interaction profiling and studying the effects of mutations on PPIs (see section 1.1), it cannot directly indicate whether reduced yeast growth is due to a par- tial misfolding, or unfolding of the proteins, as Y2H does not allow to monitor the protein expression in a real time. Other additional tech- niques like western blotting are needed for validation. On the other side, fluorescent tagging of proteins before transformation followed by flow cytometry can also be used to check protein levels, but this requires additional steps, costs and expertise. Overall, while Y2H is a simple and useful assay for detecting PPIs, its limitations hinder its ability to fully characterize interaction interfaces. Following principles similar to Y2H, many binary methods have been subsequently developed to mitigate the shortcomings of Y2H. Examples of these methods include the Protein Fragment Complementation Assay (PCA) and the Mammalian Protein-protein Interaction Trap (MAP- PIT) assay (see Table 1, 9-10). PCA, a reporter protein (e.g. GFP) is split into two non-functional fragments. These fragments are genet- ically fused to the proteins of interest, one to each fragment. When the two proteins interact, the fragments come into proximity, allowing the reporter protein to reassemble and regain its functional state, which serves as a readout for the interaction. The advantage of this assay is the detection of likely direct PPIs in living mammalian cells, therefore providing a more optimal cellular context for testing human proteins for the interaction. Similarly, MAPPIT is based on the reconstitution of the JAK-STAT signaling pathway, a key pathway involved in cytokine- mediated signal transduction. In MAPPIT, a mutated cytokine receptor fused to a bait protein recruits JAK upon interaction with a prey protein, leading to STAT activation and reporter gene transcription (Lievens et al. 2011). Due to the involvement of this pathway, the assay is limited to PPIs that occur near the plasma membrane. Additionally, steric hin- 34 drance might interfere with potential interactions. Both methods pro- vide the mammalian context of tested interactions, but it is not possible to monitor protein expression which is essential for the characterization of putative DMIs. Alternative to the methods that rely on the reconstitution of the re- porter protein, methods like Förster resonance energy transfer (FRET) and Bioluminescence resonance energy transfer (BRET) offer more di- rect readouts based on physical proximity. These assays detect PPIs through non-radiative energy transfer between a donor and acceptor molecule, which occurs only when they are in close proximity. In FRET, proteins are fused with donor and acceptor fluorophores, and upon inter- action, energy is transferred from the donor to the acceptor, generating a detectable fluorescent signal (Sekar et al. 2003; S. S. Vogel et al. 2006; Grünberg et al. 2013). In BRET, luciferase is used as a donor and fluorescent protein acts as an acceptor. The donor is not excited with monochromatic light at its specific excitation wavelength. Instead, the luciferase donor is acti- vated by a chemical substrate, such as coelenterazine-h. This substrate undergoes oxidation by the luciferase enzyme leading to the emission of light (Xu et al. 1999). For example, the Nanoluc luciferase tag when using coelenterazine-h emits light with a maximum wavelength of 460 nm (Hall et al. 2012). Upon the addition and oxidation of the substrate, when the proteins are in proximity, the energy is transferred from the donor to the acceptor. The emitted luminescence is commonly detected at the short wavelength of the donor, and the long wavelength of the acceptor. The ratio of acceptor energy over donor is the BRET ratio, indicating the potential interaction (Pfleger et al. 2006). Both methods provide real-time study of transient PPIs in living mammalian cells. Monitoring protein expression levels is crucial for characterizing interaction interfaces and understanding potential interaction failures. To quantify the binding affinity, saturation experiments can be also per- formed in both methods, where the quantity of one interaction partner is kept constant while increasing amounts of the other protein (Pfleger et al. 2006; Trepte et al. 2018). Along with monitoring the expression levels of proteins, it is possible to check the localization of mutated proteins relative to their wild-types. For example, it can be achieved with bioluminescence imaging (BLI) using high-content screening (HCS) microscopy (J. Kim et al. 2024). In a high-content screening system, a plate with the co-expressed tagged proteins in cells is visualized. The HCS is equipped with high-sensitivity cameras and appropriate filters to detect the specific wavelengths of light emitted by the tags. First, the fluorescence expression proteins 35 are captured and then upon the addition of substrate luminescence is measured. The main limitation lies in the sensitivity of the assay and the ori- entation of the tag. As it relies on the proximity it may catch indirect interactions that are involved in a protein complex. In contrast, the real PPI might not be detected due to the steric hindrance of the tags leading to false negatives. In contrast to FRET, BRET offers several advantages: • the use of luminescence and the substrate in BRET excludes ac- ceptor cross-excitation and donor photobleaching, which simplifies data analysis • the reduced auto-fluorescence • luciferase provides a high sensitivity due to increased signal-to- background ratios • lower amounts of DNAs are sufficient due to the high sensitivity Hence, these advantages make BRET a suitable approach for the validation of putative DMI interfaces. In a recent study Wanker and colleagues combined BRET and Co-IP with a luminescence-based read- out in one method (Trepte et al. 2018). This method named luminescence-based two-hybrid assay, shortly LuTHy, provides a double- readout for PPI detection, which enhances the confidence of identified PPIs in a high-throughput. Overall, the advantages of the BRET assay might be the optimal choice to be incorporated into the proposed strategy to tackle the ques- tions addressed in my study. Wanker lab kindly provided us with the necessary donor and acceptor vectors, as well as the controls for our study described in Chapters II and III. 1.6 Aims of the thesis Despite advances in sequencing technologies, most genetic variants re- main poorly understood, hindering our grasp of disease mechanisms and complicating clinical diagnosis and treatment (see subsection 1.1.2). Edgotyping has been proposed as a strategy to address this challenge (see subsection 1.1.2). While several studies have attempted to apply this strategy using Y2H, to do it entirely experimental is laborious and expensive and, therefore less efficient. To address this question, my goal of the study is to propose a systematic approach enabling the character- ization of variant effects by predicting DMIs and using this information 36 for PPI profiling. This approach will include both computational and experimental methods. First, I aimed to build the experimental pipeline to validate puta- tive DMI interfaces. To achieve this, I need binary PPI data that can serve as a resource for discovering new interfaces. The HuRI dataset is the largest dataset of binary protein-protein interactions (Luck et al. 2020) described in thesis section 1.1.2. In our lab, we have full access to the open reading frame (ORF) HuRI collection. However, the OR- Feome collection currently exists in a single copy, while creating multiple copies for use and storage is essential. Since cloning procedures and site- directed mutagenesis are necessary for mutating proteins of interest and testing PPIs I will probe the cloning in tube format first, then adapt it to the plate format. Moreover, as the BRET assay has not yet been established in our lab, I will also assess its sensitivity. The second aim is to employ a data-driven approach to select PPIs with predicted DMIs suitable for experimental validation. First, DMIs need to be predicted. My colleague developed a DMI tool, used this tool to generate predictions and mapped putative DMIs on the HuRI PPI dataset. To get mutations that may fall into predicted DMIs, another colleague processed the mutations from the largest patient database, ClinVar and overlapped the ClinVar mutations with mapped interface predictions. To further select PPIs mapped with DMIs and overlapped with mutation data suitable for the experimental validation, I need to know which ORFs and which isoforms are available in the ORFeome col- lection, and how many of those are cloneable and present in a full-length context. Moreover, manual inspection of predicted DMIs for biological relevance will be done. To do these proteins will be annotated with ex- perimental and biological information. Furthermore, for the experimen- tal validation of selected PPIs, controls such as known DMI-mediated PPIs and the PPIs mediated by different interfaces like DDI served as positive and negative controls will be also chosen and included in the study. 37 Chapter 2 The development of the medium-throughput cloning and the BRET assay pipeline for the experimental validation of predicted DMIs 2.1 Preparation of the wild-type human ORFeome collection As stated in Aim 1, the availability of a comprehensive ORFeome col- lection is essential for my project. This collection provides access to GATEWAY-compatible clones for most wild-type proteins from the HuRI dataset, which are necessary for cloning into LuTHy expression vectors that will be used in interaction profiling. My supervisor Katja Luck brought one copy of a human ORFeome collection from her PostDoc lab comprising ORFs for around 17,500 human protein-coding genes. These ORFs are stored as GATEWAY- compatible clones, allowing them to be transferred to the destination vectors carrying the fluorescence and luminescence reporter tags needed for BRET assay. As this collection only came in one copy, for mainte- nance and safety reasons, together with my colleague, I adapted and optimized the protocol for making 3 copies of the ORFeome using the Rainin liquidator 96 Manual pipetting system, kindly provided by Khmelinskii group (Figure 2.1). The first copy serves as a working collection, the second copy will be backed up and the final copy is sup- posed to be given to the media lab for the IMB community as an open resource. Overall, the 2-day protocol enables handling 16 96-well plates 38 for making three copies. Original plates are thawed and fresh plates with media are inoculated and placed for incubation overnight. The in- cubation can be challenging due to a vast evaporation effect that leads to losing the volume needed to make three copies on the second day after the incubation. Therefore, to optimize this step we tested different incubators, materials to seal plates, boxes to cover plates in the incu- bator, and testing the well volumes we could use. In two months, we successfully copied 238 plates. Figure 2.1: The scheme of cloning pipeline followed by BRET assay.The ORFeome collection was copied, where one was a backup copy, the second was made for the IMB community and the third served as a working copy. The ORFs selected for cloning were incubated in 96 deep-well plates overnight, and DNA on the next day was verified by sequencing. Next, the clones were shuffled from the donor GATEWAY vector to the destination vector using the LR reaction. The mutant constructs are generated by site-directed mutagenesis and sequence verified. Upon cloning, the BRET assay is performed. 2.2 The assessment of the sensitivity of BRET assay To evaluate the sensitivity of the assay chosen for the experimental val- idation I needed to adapt the cloning of GATEWAY vectors from one tube to plate format and adapt the mutagenesis protocol to a medium- 39 throughput pipeline. To do this I used the open reading frames (ORFs) coding for proteins, mutations and PPIs as well as controls from my collaborative project with the Koenig (IMB) and Sattler (Institute of Structural Biology, Helmholtz German Research Center for Environ- mental Health ) groups. The Koenig lab has recently established Far Upstream Element Bind- ing Protein 1 (FUBP1) as a novel regulator in mRNA splicing. Our aim in this project is to aid in the identification of PPIs between FUBP1 and known protein components of the 5’ and 3’ splice sites as well as of the branchpoint on the mRNA and to delineate the corresponding in- teraction interfaces. For this project, I generated 64 different constructs using the cloning technique followed by BRET assay (Figure 2.1). To test the sensitivity of the assay, I transfected different ratios of ORFs in donor-acceptor constructs (1:10 ng, 1:20 ng, 1:50 ng, 1:100 ng, 1:200 ng) into HEK293 mammalian cells for co-expression. Along with these pairs, I also included the standard controls. Wanker group, developers of the LuTHy method kindly provided us with stan- dard controls including empty vector controls to rule out background effects from the vector, donor-only (NanoLuc) and acceptor-only (mC- itrine) constructs to ensure interactions require both constructs and non- interacting protein pairs to check for false positives. A positive control pair with the known protein-protein interaction BAD-BCL2L1 was in- cluded to validate the system’s functionality. Additionally, I used random protein pair controls for each tested pair, consisting of proteins not expected to interact, such as those with differ- ent cellular localizations (e.g., nuclear proteins paired with cytoplasmic proteins). As proteins of interest are localized in the nucleus and pro- teins from protein pairs are found in the cytoplasm, we paired up the tested protein of interest with one protein from the positive pair. I tested all interactions together with controls and quantified BRET. The corrected BRET (cBRET) ratio is calculated by subtracting either the BRET ratio of controls (donor-only (i.e. NanoLuc) and acceptor- only (i.e. mCitrine) constructs) from the BRET ratio of the studied interaction of interest. Our findings showed that cBRET values for the weak interactions were close to the cBRET values of the random pairs. Based on this information, I learned that a high amount of transfected cDNA might lead to the generation of false-positive data. Therefore, we questioned those findings and evaluated assay specificity by testing the range of different DNA ratios of the previously tested interactions. I discovered that 1:50 ng appears to be a good ratio for the discrimination of significant from non-significant cBRET signals (Figure 2.2). In summary, copying the ORFeome collection and testing BRET as- 40 Figure 2.2: The evaluation of BRET’s sensitivity with 1:50 ng of donor: acceptor DNA ratio. The plot represents calculated cBRET ratios for tested FUBP1 (orange), U2AF2 (red) interactions, positive controls (green) and random pairs (gray) as a function of acceptor to donor (acc/don) protein expression ratio. All values are the mean +/- s.d. from two technical replicates. say sensitivity enabled quick access to the ORFs and helped define the ratio of tested constructs needed for the transfection to avoid false pos- itives. This allowed us to explore the application of BRET assay in in- teraction profiling to further explore protein-protein interactions (PPIs) involved in mRNA splicing. 2.3 Article I: FUBP1 is a general splicing factor fa- cilitating 3’ splice site recognition and splicing of long introns Summary The splicing of pre-mRNA plays a crucial role in gene regulation and the expansion of the proteome in eukaryotes. However, the information on how the recognition of splice sites and pairing during spliceosome as- sembly occurs lacks details. This project focused on understanding the role of FUBP1 in RNA splicing, particularly its function in the recogni- tion and processing of 3’ splice sites and splicing of long introns. Using in vivo iCLIP analysis we found that FUBP1 binds to 91.3 % of 3’ splice sites in a similar pattern as core splicing factors like SF1, U2AF2 and SF3B1. Further investigation showed that FUBP1 recognized cis- regulatory RNA motif located upstream of the branch point (BP) in 41 pre-mRNA. Through EMSA and ITC experiments, we demonstrated that FUBP1 binds GU-rich sequences. This was further validated by NMR and in vivo iCLIP data showing that KH domains of FUBP1 independently recognize these motifs. Moreover, kinetic modeling and transcriptional profiling demonstrated that FUBP1 is required for effi- cient splicing of long introns, which represent 80 % of human introns. Next, we explored the interactions of FUBP1 with other splicing fac- tors. First, we studied FUBP1 interactions with components of spliceo- some complexes. Here, NMR analysis provided insights into the interac- tion between FUBP1 and U2AF2, the key component 3’ splice complex. The preliminary structure from the NMR study suggests that the second RRM domain of U2AF2 and the N-terminal N-box of FUBP1 protein represent the minimal binding regions. Furthermore, we found that the amino acid change from alanine to aspartate at residue 38 (A38D) sitting at the N-box of FUBP1 disrupts the interaction using recombinant frag- ments of FUBP1 and U2AF2. This data was supported in a full-length context using the BRET experiments in mammalian cells (Article I, Figure 3, C & J). Given the obtained BRET data, we observed that the presence of mutation significantly increased the distance between proteins, but the interaction was not completely disrupted. Here, we hypothesize that mutated FUBP1 and U2AF2 still interact due to the binding to the same mRNA but the contact between both is much weaker due to the lost direct interface between both. With BRET we also con- firmed the known interaction interface between FUBP1 and SF1, pro- tein of U2 complex at the 5’ splice site (Article I, Figure 3, C & D). Furthermore, we tested the interactions of FUBP1 with U1 snRNP- associated proteins, including SNRPA, SNRPC, TIAL1, PRPF40B and SNRBP as well as TCERG1 and KHDRBS1. We further confirmed these interactions of FUBP1 with U1-associated proteins with BRET and/or NMR. Interestingly, NMR analysis proposed that FUBP1’s A/B boxes interact with proline-rich regions from SNRPB. These changes were less pronounced with SNRPA and PRPF40B containing similar proline-rich stretches (Article I, Figure 6, G). Overall, this study provided a comprehensive analysis demonstrating the global role of FUBP1 pre-mRNA splicing processes. Our key findings suggest that FUBP1 acts as a general splicing regulator at the 3’ splice site. Moreover, many tested interactions mediated via domain-motif interface were able to be detected by BRET, and disruption of these interfaces by point mutations established this assay as a valuable system to validate predicted DMIs. 42 43 Article FUBP1 is a general splicing factor facilitating 30 splice site recognition and splicing of long introns Graphical abstract Authors Stefanie Ebersberger, Clara Hipp, Miriam M. Mulorz, ..., Katja Luck, Michael Sattler, Julian König Correspondence k.luck@imb-mainz.de (K.L.), michael.sattler@helmholtz-munich.de (M.S.), j.koenig@imb-mainz.de (J.K.) In brief Ebersberger et al. identify the RNA- binding protein FUBP1 as a key splicing factor that binds to a hitherto unknown cis-regulatory motif at 30 splice sites. Multivalent interactions of FUBP1 with splice site components support spliceosome assembly at multiple stages and ensure efficient splicing of long introns. Highlights d FUBP1 recognizes a ubiquitous cis-regulatory RNA motif upstream of the branch point d Multivalent interactions in disordered FUBP1 regions support spliceosome assembly d FUBP1 affects long introns, which are prevalent in humans and altered in cancer d Kinetic modeling and protein interactions implicate FUBP1 in splice site bridging Ebersberger et al., 2023, Molecular Cell 83, 2653–2672 August 3, 2023 ª 2023 The Author(s). Published by Elsevier Inc. https://doi.org/10.1016/j.molcel.2023.07.002 ll 44 ll OPEN ACCESS Article FUBP1 is a general splicing factor facilitating 30 splice site recognition and splicing of long introns Stefanie Ebersberger,1,12 Clara Hipp,2,3,12 Miriam M. Mulorz,1,12 Andreas Buchbender,1 Dalmira Hubrich,1 Hyun-Seo Kang,2,3 Santiago Martı́nez-Lumbreras,2,3 Panajot Kristofori,4 F.X. Reymond Sutandy,1 Lidia Llacsahuanga Allcca,1,13 Jonas Schönfeld,1 Cem Bakisoglu,5 Anke Busch,1 Heike Ha€nel,1 Kerstin Tretow,1 Mareen Welzel,1 Antonella Di Liddo,1 Martin M. Möckel,1 Kathi Zarnack,5,6 Ingo Ebersberger,7,8,9 Stefan Legewie,10,11 Katja Luck,1,* Michael Sattler,2,3,* and Julian König1,14,* 1Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany 2Institute of Structural Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany 3Bavarian NMR Center, Department of Bioscience, School of Natural Sciences, Technical University of Munich, 85747 Garching, Germany 4Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, 70569 Stuttgart, Germany 5Buchmann Institute for Molecular Life Sciences & Institute of Molecular Biosciences, Goethe University Frankfurt, 60438 Frankfurt amMain, Germany 6CardioPulmonary Institute (CPI), 35392 Gießen, Germany 7Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany 8Senckenberg Biodiversity and Climate Research Center (S-BIK-F), 60325 Frankfurt am Main, Germany 9LOEWE Center for Translational Biodiversity Genomics (TBG), 60325 Frankfurt am Main, Germany 10Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, 70569 Stuttgart, Germany 11Stuttgart Research Center for Systems Biology (SRCSB), University of Stuttgart, 70569 Stuttgart, Germany 12These authors contributed equally 13Present address: University of California, Berkeley, CA 94720, USA 14Lead contact *Correspondence: k.luck@imb-mainz.de (K.L.), michael.sattler@helmholtz-munich.de (M.S.), j.koenig@imb-mainz.de (J.K.) https://doi.org/10.1016/j.molcel.2023.07.002 SUMMARY Splicing of pre-mRNAs critically contributes to gene regulation and proteome expansion in eukaryotes, but our understanding of the recognition and pairing of splice sites during spliceosome assembly lacks detail. Here, we identify the multidomain RNA-binding protein FUBP1 as a key splicing factor that binds to a hitherto unknown cis-regulatory motif. By collecting NMR, structural, and in vivo interaction data, we demonstrate that FUBP1 stabilizes U2AF2 and SF1, key components at the 30 splice site, through multivalent binding in- terfaces located within its disordered regions. Transcriptional profiling and kinetic modeling reveal that FUBP1 is required for efficient splicing of long introns, which is impaired in cancer patients harboring FUBP1 mutations. Notably, FUBP1 interacts with numerous U1 snRNP-associated proteins, suggesting a unique role for FUBP1 in splice site bridging for long introns. We propose a compelling model for 30 splice site recognition of long introns, which represent 80% of all human introns. INTRODUCTION tide,12,13 polypyrimidine (Py) tract,14–16 and branch point (BP) site, respectively (Figure 1A).9,17 In the resulting A complex, U2 Splicing is a crucial step in eukaryotic mRNA processing, and its snRNP is recruited to the BP and stabilized by SF3A and dysregulation is a hallmark of many cancers.1–3 Splicing is cata- SF3B, and SF1 is released.18,19 Subsequent snRNP recruitment lyzed by the spliceosome, a megadalton machinery comprising and further rearrangements (formation of B and C complexes) five small nuclear ribonucleoprotein (snRNP) complexes named mediate intron excision and exon ligation to form the U1, U2, U4, U5, and U6.4–7 During early spliceosome assembly mature mRNA. (E complex formation), the 50 and 30 splice sites are recognized: Strikingly, mechanistic details of splice site recognition by U1 binds at the 50 splice site, whereas U2 auxiliary factor 1 multidomain splicing factors during early spliceosome assembly (U2AF1), U2AF2, and splicing factor 1 (SF1) assemble at the 30 are lacking.20,21 U2AF2 binding is central to the early definition of splice site,6–11 where they specifically recognize AG dinucleo- splice sites and is subject to layers of regulation including direct Molecular Cell 83, 2653–2672, August 3, 2023 ª 2023 The Author(s). Published by Elsevier Inc. 2653 This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 45 ll OPEN ACCESS Article A B C D E Figure 1. FUBP1 binds upstream of the branch point at 30 splice sites during early spliceosome assembly in vivo (A) Schematic of spatial RBP assembly at the 30 splice site in the ‘‘commitment’’ E complex and the pre-spliceosomal A complex. BP, branch point. (B) iCLIP in HeLa cells. Distribution of binding sites across transcript regions for FUBP1 (n = 854,404), U2AF2 (n = 914,221), SF1 (n = 99,305), SF3B1 (n = 1,694,991), and PTBP1 (n = 127,450). 30 and 50 splice sites (ss) refer to 100 nt upstream/downstream of exons, respectively. CDS, coding sequence; UTR, untranslated region. (C) Metaprofiles of cross-link events of FUBP1, U2AF2, SF1, SF3B1, and PTBP1 relative to the BP. (D) Genome browser view of an internal exon in the CPS1 mRNA displaying the iCLIP data for FUBP1, U2AF2, SF1, and SF3B1 from HeLa cells. (E) Saturation analysis showing the percentage of bound 30 splice sites for each RBP in each quantile. competition, cooperative recruitment, change of RNA secondary FUBP1was initially characterized as a transcriptional regulator structure, dynamic conformational states, and autoinhibi- of the proto-oncogene c-myc through binding to AT-rich DNA tion.15,22–30 Despite the pivotal role of U2AF2, the precise contri- elements and interaction with PUF60, also known as the bution of cofactors and multivalent interactions are yet to be FUBP-interacting repressor (FIR).31–34 However, more recently, elucidated. Recently, we reported how U2AF2 achieves speci- FUBP1 has also been reported to bind RNA and to influence ficity despite the degeneracy of its pyrimidine-rich RNA-binding translation or splicing of specific transcripts.35–38 Similar to its motif.28 In this study, we found that the RNA-binding protein DNA-binding specificity, FUBP1 exhibits a general preference (RBP) far upstream binding protein 1 (FUBP1) promotes U2AF2 for AU- and GU-rich RNA31 that is expected to derive from its binding to RNA. four K homology (KH) domains.39 Notably, cancer-associated 2654 Molecular Cell 83, 2653–2672, August 3, 2023 46 ll Article OPEN ACCESS A B C D E F G H I Figure 2. FUBP1 binds a hitherto unknown cis-regulatory motif upstream of the BP (A) Genome browser view of an internal exon in the VPS13D mRNA displaying the iCLIP data for FUBP1, U2AF2, SF1, and SF3B1. (B) Domain architecture of FUBP1. KH, K homology domain; P-rich, proline-rich stretch. (C) Agarose gel (left) and quantification with fitted curve (right) from an EMSA experiment with recombinant FUBP1N-box+KH (50–6,400 nM) and a fluorescently labeled 132-nt RNA fragment of VPS13D (100 nM). Measurements were performed in duplicates and data are represented as mean ± standard deviation (SD). (D) Binding affinity for the interaction of FUBP1KH with VPS13D RNA determined by ITC. ITC measurements were performed in triplicates and data are repre- sented as mean ± SD. (legend continued on next page) Molecular Cell 83, 2653–2672, August 3, 2023 2655 47 ll OPEN ACCESS Article loss-of-function mutations within FUBP1 have been connected phoretic mobility shift assays (EMSAs) with a 132-nt RNA frag- to global splicing changes in low-grade glioma,1,40–42 suggesting ment upstream of the prototypical 30 splice site of exon 43 of an RNA-regulatory role in these processes. Here, we reveal a the VPS13D mRNA (VPS13D) and a shortened fragment (36 nt) global role for FUBP1 in pre-mRNA splicing. Our results suggest with the region showing the most FUBP1 binding in iCLIP that FUBP1 functions as a general splicing factor at the 30 splice (VPS13Dshort; Figure 2A). We observed strong binding of site, with a crucial role in promoting efficient splicing of long in- FUBP1 (FUBP1N-box+KH, aa 1–457) to both RNAs in the low nano- trons, whichmake up over 80%of human pre-mRNA transcripts. molar range (Figures 2B, 2C, and S1C). Isothermal titration calo- rimetry (ITC) with VPS13D yielded a similar result (Figure 2D; RESULTS Table S2), confirming the high-affinity binding at this region. FUBP1 harbors four KH domains, which are expected to bind FUBP1 is a core component of 30 splice site recognition single-stranded RNA and DNA32,52 and can act either indepen- Todissect the role of FUBP1 in splicing,weexamined the footprint dently or synergistically53–55 to recognize extended regions of of FUBP1 and other splicing factors on pre-mRNA in HeLa cells pre-mRNA. We used nuclear magnetic resonance (NMR) spec- using in vivo individual-nucleotide resolution UV cross-linking troscopy to investigate the modular arrangement of the four and immunoprecipitation (in vivo iCLIP; Figures 1B and S1A; FUBP1 KH domains. Superimposition showed that the NMR Table S1).43,44 As expected, large proportions of the binding sites spectrum of FUBP1KH (aa 86–457) containing KH1–4 was virtu- of SF1, U2AF2, and SF3B1 are located at 30 splice sites (10%, ally identical to those of the individual KH domains, indicating 17%, and 22%, respectively). Interestingly, FUBP1 shows a that the KH domains are structurally independent (Figure S1D). similar preference for 30 splice sites (19%). By contrast, for the Furthermore, NMR secondary structure analysis revealed that more restricted splicing regulator PTBP1, which is known to act FUBP1 contains KH domains with a typical type I fold that are on a subset of exons, only 1% of binding sites are located at 30 connected by flexible linkers (Figure S1E).56 We conclude that splicesites.Weconfirmed thatU2AF2bindsat thePy tract located the KH domains of FUBP1 are not preformed into an RNA-bind- between theBPand30 splice site,45,46whereasSF1bindingpeaks ing platform but rather can be considered like beads on a string. at the BP, with a reduced signal at the BP adenine itself,9,17 pre- To characterize the individual RNA-binding preferences of the sumably owing to the lower cross-linking efficiency of adenine four KH domains, we performed a scaffold-independent analysis (Figures 1C and 1D).47 Consistent with a previous report,48 (SIA), which is based on changes in NMR chemical shifts upon SF3B1 binds in a clamp-wise manner up- and downstream of titration with short oligonucleotide motifs (Figure S2A).57 Initial the BP. Strikingly, FUBP1 also shows a pronounced footprint at binding experiments were performed using randomized pools the BP (Figures 1C and 1D). Its binding peaks at a location 34 nu- of 5-mer DNA, followed by verification of the identified motifs us- cleotides (nt) upstream of the BP and tails for up to 100 nt. In com- ing RNA oligonucleotides (Figure S2B). SIA identified well- parison, PTBP1 does not display such a ubiquitous positioning at defined consensus motifs for KH1 (UUUG) and KH2 (UUGU) 30 splice sites (Figure 1C).49,50Next,weaddressedwhat fractionof and more loosely defined motifs for KH3 (YBKK, where Y = C 30 splice sites is bound using a saturation-basedanalysis that con- or U; B = C, G, or U; K = G or U) and KH4 (YUKK). Hence, all trols for splice site usage and transcript abundance.51 We found four KH domains exhibit a preference for GU-rich sequences that FUBP1 binds the same percentage of 30 splice sites as (Figure 2E). The affinities of the individual KH domains to the final U2AF2 and SF3B1, which are both universally present at 30 splice motifs, as determined by NMR spectroscopy, are in the high sites (91.3%, 95.4%, and 99.6%, respectively; Figure 1E). By micromolar range (Figures S2C–S2F). Combinations of two KH contrast, SF1 and PTBP1 are associated with 27.3% and 3.1% domains and motifs show strong binding avidity: the ITC- of 30 splice sites, respectively (Figures 1E and S1B). Overall, these measured affinities for tandem domains were in the high nano- data suggest that FUBP1 functions as a general splicing factor in molar to low micromolar range (Figures S2G–S2I; Table S2). early spliceosome assembly. This suggests that specificity and high affinity are achieved by avidity and multivalent interactions between the four KH do- FUBP1 binds a cis-regulatory RNAmotif upstream of the mains and RNA with multiple binding motifs (Figure 2F). Indeed, branch point EMSA and ITC experiments confirmed that multiple FUBP1 Given the prevalence of FUBP1 upstream of the BP, we investi- binding motifs in the VPS13D mRNA fragment increase FUBP1 gated its RNA-binding preferences. First, we performed electro- binding to nanomolar affinity (Figures 2A, 2C, and 2D). (E) Scaffold-independent analysis (SIA)-derived binding motifs for individual FUBP1 KH domains. Preferred bases are highlighted in white. Y, pyrimidine (T or C); B, not A (C, G, or T); K, keto (G or T). (F) KD values of individual and tandem KH domains with their optimal DNA target (KH1, TTTTG; KH2, TTTGT; KH3, TCTGT; KH4, TTTTG; KH1-2, TTTGTAAAATTTTG; KH2-3, TCTGTAAAATTTGT; KH3-4, TTTTGAAAATCTGT) determined by NMR or ITC, respectively (Figures S2C–S2I; Table S2). ITC measurements were performed in triplicates. For NMR, the KD values of eight selected residues were calculated. Data are represented as mean ± SD. (G) Motif enrichment in the in vivo FUBP1 iCLIP data. Disjunct 4-mer frequencies were calculated for the top vs bottom 20%of binding sites based on expression- normalized iCLIP signals. (H) Positional enrichment of FUBP1 binding motifs and control motifs relative to the BP. UUU+A/G/C, i.e., 4-mers containing UUU interspersed at any position with A/G/C. NNNN, 100,000 sets of random combinations of four 4-mers. 4-mer frequencies were calculated position-wise upstream of the BP and compared with the average 4-mer frequencies in an intronic control region. Top: Metaprofile of normalized FUBP1 and SF3B1 iCLIP cross-link events at the same 30 splice sites is shown for comparison. (I) Abundance of FUBP1 binding motifs at 30 splice sites of human introns. Background distribution for all possible 4-mers (mean ± 1 SD) is shown in gray. 2656 Molecular Cell 83, 2653–2672, August 3, 2023 48 ll Article OPEN ACCESS A C B E D F G H I J (legend on next page) Molecular Cell 83, 2653–2672, August 3, 2023 2657 49 ll OPEN ACCESS Article Interrupting the U-rich motifs of VPS13Dshort with cytidines C-terminal region of SF1. Consistently, the SF1-FUBP1 interac- severely reduces the binding affinity, underlining the specificity tion detected by BRET is reduced upon deletion or mutation of of the FUBP1-RNA interaction (Figure S1C). the A/B boxes (Figures 3B–3D and S2M). In addition, 1H-15N cor- To validate the interaction between FUBP1 and RNA motifs in relation NMR spectra of the FUBP1A/B box region show specific cells, we compared 4-mer motifs in the sites of strongest FUBP1 chemical shift perturbations (CSPs) upon titration with the pro- binding over background in the in vivo iCLIP data. In line with the line-rich region of SF1, indicating direct binding (Figure 3E). SIA, we found a strong preference for uridine-rich motifs at To map the interacting regions in FUBP1 and U2AF2, FUBP1 binding sites (Figures 2G and S2J). For in vivo binding, we performed NMR titration experiments using 15N-labeled these motifs can be interspersed at any position by adenine or, U2AF2RRM12 14,15,30 and unlabeled full-length FUBP1. Large to a lesser extent, by guanine. Consistent with the omnipresence CSPs and line broadening in the 1H-15N correlation spectra of FUBP1 at 30 splice sites, we observed a striking enrichment of exclusively map to the U2AF2 RRM2 domain, especially to the FUBP1 binding motifs (‘‘UUU+A’’ and ‘‘UUU+G,’’ i.e., three uri- two a helices on the backside of the b sheets that mediate dines interspersed at any position with adenine or guanine) up- RNA binding (Figures 3B, 3F, S2N, and S3A). Moreover, a stream of the BP, where they coincide with FUBP1 binding (Fig- construct comprising the N-terminal region of FUBP1 ure 2H). Conversely, both ‘‘UUU+C,’’ accounting for general (FUBP1N74, aa 1–74) recapitulates the CSPs observed with full- uridine richness, and random motif sets are enriched closer to length FUBP1,whereas a construct lacking theN-terminal region the 30 splice site but not in the main region of FUBP1 binding. (FUBP1DN, aa 75–644) does not yield any evident CSPs Importantly, enriched FUBP1 motifs upstream of the BP are a (Figures 3F and S3A). Complementary NMR titrations with 15N- common feature across all annotated introns (Figures 2H, 2I, labeled FUBP1 constructs identify the U2AF2 RRM2 domain S2K, and S2L), indicating that we identified a previously un- and a short peptide motif in the N-terminal region of FUBP1 known cis-regulatory RNA motif in splicing regulation. (aa 27–52), referred to as N-box, as the minimal binding regions (Figures S3B–S3F). The U2AF2 RRM2-FUBP1 N-box interaction FUBP1 directly interacts with U2AF2 and SF1 exhibits micromolar affinity by NMR titrations (Figures 3G, S3D, Given the prevalence of FUBP1 at functional 30 splice sites, we S3G, and S3H; Table S2). examined whether FUBP1 interacts with key early 30 splice site To provide a high-resolution view, we determined the NMR- components in cells using bioluminescence resonance energy derived solution structure of the U2AF2 RRM2-FUBP1 N-box transfer (BRET) (Figure 3A).58 Interaction signals in the BRET complex (Figures 3H and S3I–S3K; Table 1). This structure assay are indicative of direct contacts or close proximities. As shows a well-defined U2AF2 RRM2 domain and a more mobile a proof-of-concept, we confirmed the known U2AF2-SF1 inter- helical FUBP1 N-box and reveals that the FUBP1N-box forms an action.8,10,18,58 Importantly, we also observed interactions of a helix, which is recognized by helices a1 and a2 and the b4 FUBP1with U2AF2 andSF1 (Figures 3B, 3C, and S2M), suggest- strand of U2AF2RRM2. Hydrophobic interactions dominate at ing that FUBP1 is in close or direct contact with these core this interface, where four alanines in FUBP1N-box (A30, A34, splicing factors inside cells. A38, and A42) are aligned along the extended hydrophobic inter- To investigate whether FUBP1 directly interacts with SF1 (Fig- face, with A38 positioned centrally. Additional contacts involving ure 3C), we focused on the C-terminal region of FUBP1, which bulkier side chains, that is, R37 and I41 in FUBP1 and L278 and harbors the A and B boxes (A/B boxes). Thesemotifs are specific M323 in U2AF2, further stabilize the binding interface. The to the FUBP family of proteins and have been shown to mediate recognition of the FUBP1 N-box resembles the interaction be- binding to a proline-rich region of snRNP-U1-70K in fruit tween FUBP1 N-box and PUF60,34 consistent with structural flies.60,59 Similar proline-rich regions are also present in the similarities between PUF60 and U2AF2 RRM2 (Figure S4A).34 Figure 3. FUBP1 directly interacts with SF1 and U2AF2 via its C-terminal A/B boxes and N-terminal N-box (A) Schematic of BRET assay. Energy transfer between the substrate oxidized by NanoLuc luciferase (donor, Don) andmCitrine (acceptor, Acc) occurs if proteins X and Y interact. (B) Domain architecture of U2AF2 (UniProt: P26368) and SF1 (UniProt: Q15637). ULM, U2AF ligand motif; RRM, RNA-recognition motif; UHM, U2AF homology motif family; Qua2, quaking homology 2 domain; ZF, zinc finger. (C) BRET values for tested interaction pairs and controls. Two biological replicates are shown. Error bars represent SD of technical triplicates. Trp-to-Arg mutations in the A/B boxes were rationalized based on disrupting the hydrophobic contacts as previously reported.59 (D) BRET saturation curves for combinations of FUBP1 variants and wild-type SF1. Trp-to-Arg mutations in the A/B boxes or their deletion significantly lowered the maximal BRET signal, although changes in the BRET50 (acceptor/donor ratio at which half-maximal BRET signal is reached) were not significant. Amounts of acceptor and donor proteins were estimated by fluorescence and total luminescence, respectively, in intact cells. Two biological replicates are shown. Error bars represent SD of technical triplicates. (E) NMR titration of FUBP1A/B with SF1P-rich. Significant chemical shift changes are highlighted by boxes. (F) Binding interface mapping based on NMR titration of U2AF2RRM12 with full-length FUBP1, FUBP1N74, and FUBP1DN (Figure S3A). (G) Binding affinity for the interaction of FUBP1N-box and U2AF2RRM2 from NMR titrations. Chemical shift differences of four exemplary residues of FUBP1N-box (Figures S3D and S3G) are fitted to binding isotherm to estimate the KD. Data are represented as mean ± SD of calculated KD values of eight selected residues. (H) NMR-derived structure of the complex of U2AF2RRM2 (green) and FUBP1N-box (brown) (Figure S3K; Table 1, PDB: 8P25). (I) Comparison of NMR titrations of FUBP1N-box WT and mutant FUBP1N-box+A38D with U2AF2RRM2. (J) BRET saturation curves for wild-type FUBP1 andmutant FUBP1A38D against U2AF2. Two biological replicates are shown. Error bars represent SD of technical triplicates. 2658 Molecular Cell 83, 2653–2672, August 3, 2023 50 ll Article OPEN ACCESS Table 1. Statistics for structure calculation of the U2AF2RRM2/ U2AF2 RRM2 (Figures S4C and S4D). A significant weakening FUBP1N-box chimera, related to Figures 3H and S3K, PDB: 8P25a of the U2AF2-FUBP1 interaction by A38D in the full-length Experimental restraints context was also confirmed in cells using BRET (Figures 3C, 3J, and S2M). Here, some residual binding between Distance restraints FUBP1A38D and U2AF2 was observed, probably because both Total NOE 2,147 proteins remain in proximity through binding to the same pre- Short range, |i–j| % 1 1,047 mRNAs. As expected, A38D does not affect FUBP1-SF1 bind- Medium range, 1 < |i–j| < 5 392 ing, which occurs via the A/B boxes (Figures 3C, S4H, and Long range, |i–j| R 5 708 S4I). In summary, our experiments demonstrate that FUBP1 in- Dihedral angle restraints (from TALOS) teracts directly with U2AF2 and SF1 via its N-terminal N-box and C-terminal A/B boxes, respectively. The former interaction F 82 is severely impaired by a cancer-associated mutation in FUBP1. J 86 Structure statistics FUBP1 promotes U2AF2 binding to 30 splice sites RMSD from experimental restraints (mean and SD) To investigate the impact of FUBP1 on E complex formation, we Distance restraints (Å), no violation > 0.5 Å 0.013 ± 0.007 monitored U2AF2 binding to RNA using in vitro iCLIP. 28 To this Dihedral angle restraints (#, no violation > 0.5#) 0.19 ± 0.04 end, we designed a pool of short RNA transcripts (182 nt) repre- senting !2,000 natural 30 splice sites from human transcripts, Deviations from idealized geometry which we mixed with recombinant U2AF2RRM12 (see STAR Bond lengths (Å) 0.004 ± 0.0001 Methods). Remarkably, addition of recombinant full-length Bond angles (#) 0.60 ± 0.01 FUBP1 (FUBP1FL) results in stronger binding of U2AF2RRM12 to Impropers (#) 1.31 ± 0.04 virtually all 30 splice sites in the transcript pool (Figures 4A, 4B, Average pairwise coordinate RMSD (Å) and S5A–S5C; Table S1). The in vivo pattern of U2AF2 binding Backbone 0.92 ± 0.30 can thereby be reproduced in vitro in the presence of full-length FUBP1 (Figure 4C). The widespread effects are in contrast to Heavy atoms 1.41 ± 0.22 a those of our previous findings using in vitro-translated FUBP1,Pairwise coordinate root-mean-square deviation (RMSD) was calcu- which affected only a few U2AF2 binding sites.28 Hence, our up- lated for the 10 lowest-energy structures (regions 250–336 in U2AF2RRM2 and 31–43 in FUBP1N-box) after water refinement. Rama- dated experiments indicate that FUBP1 acts globally to stabilize chandran plot: 93.1%, 6.1%, 0.3%, and 0.4% of residues (regions 250– U2AF2 binding. We find that this effect is dependent on FUBP1 336 in U2AF2RRM2 and 31–43 in FUBP1N-Box) are found in the most concentration and is directly linked to the number of FUBP1 favored, additionally allowed, generously allowed, and disallowed binding motifs upstream of the BP (Figure 4D). To confirm these regions. findings in longer transcripts, we repeated the experiment with a pool of eight in vitro transcripts (2.0–5.7 kb; Figures S5D and S5E; Table S1). Indeed, addition of recombinant full-length Interestingly, both FUBP1 N-box-RRM interfaces show only FUBP1 increases the strength of U2AF2RRM12 binding at 30 splice limited interdigitation of the hydrophobic side chains, consistent sites (Figures 4E and S5F) and thereby reproduces the in vivo with the modest binding affinity in the micromolar range. binding pattern of U2AF2 (Figure 4F). Notably, this effect is In a recent survey of The Cancer Genome Atlas (TCGA), considerably reduced with FUBP1DN (impaired U2AF2 interac- FUBP1 was noted for its particularly high rate of non-synony- tion), and it is completely abolished with FUBP1N74 (lacking KH mousmutations in low-grade gliomas.1 To learn about themech- domains). This highlights the importance of the N-box in anistic impact of such mutations, we systematically searched FUBP1 for directly interacting with U2AF2 as well as of cancer mutation databases and identified 26 disease-related FUBP1’s RNA binding for the stabilization of U2AF2 single-nucleotide variants (SNVs) within the FUBP1 N-box (Fig- (Figures 4F and S5F). Together, this indicates that the interaction ure S4B). Five candidate mutations (A38D, A43E, K44R, I45F, of FUBP1 with both pre-mRNA and U2AF2 globally promotes and G47C) were selected by considering the magnitude of U2AF2 binding at the 30 splice site during early spliceosomal chemical shift changes occurring in the NMR titration of assembly. FUBP1N-box with U2AF2RRM2 (Figures S3B–S3D and S4B). In addition, we included L35V, which has been shown to weaken FUBP1 is critical for the splicing of long introns the FUBP1-PUF60 interaction.61 NMR analysis revealed that To investigate the impact of FUBP1 on splicing, we generated a A38D strongly impairs U2AF2 binding (Figures 3I and S4C– FUBP1 knockout (KO) RPE1 cell line using CRISPR-Cas9 S4G). This is consistent with our structure in which A38 forms genome engineering (Figures 5A and S5G) and performed the core of the hydrophobic binding interface between FUBP1 RNA-seq. MYC gene expression was unaltered, suggesting N-box and U2AF2. A bulkier negatively charged side chain in that it is not controlled by FUBP1 in RPE1 cells (Figure S5H). this position is expected to introduce steric and electrostatic Next, we examined transcriptome-wide splicing and found repulsion at the binding interface. Residue A38 in FUBP1 was 1,041 significant splicing changes, including 399 cassette exons also required for binding to PUF60 in a mutational study,61 (Figure 5B; Tables S1 and S3). Consistent with a role in splice site whereas L35V, which also affected the FUBP1-PUF60 interac- recognition, FUBP1KOpreferentially leads to exon skipping (276 tion in that study, did not impair the interaction of FUBP1 with [69%] with delta percent spliced in [DPSI] < "0.1). Molecular Cell 83, 2653–2672, August 3, 2023 2659 51 ll OPEN ACCESS Article A B C D E F Figure 4. FUBP1 stabilizes U2AF2 binding at 30 splice sites in vitro (A) Overview of FUBP1 protein variants used in in vitro iCLIP experiments. (B) Scatterplot of in vitro iCLIP signal in U2AF2 binding sites of U2AF2RRM12 alone and upon addition of full-length FUBP1 on a pool of 1,998 in vitro transcripts. (C) Genome browser view of LARP4mRNA displaying in vivo iCLIP for FUBP1 and U2AF2 and in vitro iCLIP on the respective in vitro transcript for U2AF2 alone and after addition of full-length FUBP1. (D) Number of FUBP1 bindingmotifs upstream of the BP (["100 nt;"26 nt]) in relation to the log2-transformed fold change of U2AF2RRM12 binding upon addition of full-length FUBP1 for 1,504 30 splice sites in the in vitro transcripts. (E) Metaprofile of U2AF2 binding at 30 splice sites from in vitro iCLIP with long in vitro transcripts28 and U2AF2RRM12 alone and after addition of FUBP1FL, FUBP1N74, or FUBP1DN. iCLIP signals were normalized by spike-in and averaged per nucleotide over all introns (n = 21). (F) Genome browser view of C4BPBmRNA displaying in vivo iCLIP for FUBP1 and U2AF2 and in vitro iCLIP for U2AF2RRM12 alone and after addition of FUBP1FL, FUBP1N74 or FUBP1DN. 2660 Molecular Cell 83, 2653–2672, August 3, 2023 52 ll Article OPEN ACCESS A B C D E F G H I J K Figure 5. FUBP1 binds stronger to long introns and regulates exons flanked by long introns (A) Western blot of FUBP1 in wild-type (WT), FUBP1-Nboxmut mutant, and FUBP1 KO RPE1 cells (Figure S5G). Vinculin acts as loading control. (B) Minimum adjacent intron length for cassette exons more or less included upon FUBP1 KO in RPE1 cells (n = 123/276) and FUBP1 knockdown in K562 cells (n = 30/143) compared to unchanged control exons (RPE1, n = 10,301; K562, n = 1,910). ***p < 0.001, ****p < 0.0001, n.s., not significant. (legend continued on next page) Molecular Cell 83, 2653–2672, August 3, 2023 2661 53 ll OPEN ACCESS Article A closer inspection revealed that the fate of an exon is related binding sites, the exon showed reduced inclusion (7%) and did to the length of the flanking introns: decreased inclusion in not change in the FUBP1 KO. If the introns were shortened but FUBP1 KO cells is typically observed for exons that are flanked the FUBP1 binding sites retained, the effect of FUBP1 KO or mu- by longer introns, compared with exons with increased or un- tation was reduced, albeit still present, consistent with the notion changed inclusion (Figure 5B, top). Most affected exons are that the intron is still perceived as long due to the presence of alternative exons, but we observed the same effect for regulated FUBP1 binding site. By contrast, if the FUBP1 binding sites constitutive exons (Figure S5I). Importantly, the effect on long in- were also removed, exon inclusion no longer responded to trons can be recapitulated in ENCODE62,63 data on FUBP1 FUBP1 KO or FUBP1-Nboxmut, highlighting that FUBP1 binding knockdown cells (Figure 5B, bottom). To test whether this de- is specifically required for the long-intron variant. pends on the interaction with U2AF2, we generated a FUBP1- Intriguingly, the changes at long introns are linked to FUBP1 Nboxmut mutant with a targeted deletion of A38 and neighboring binding. We found a substantial increase in FUBP1 binding at amino acids in the endogenous FUBP1 gene in RPE1 cells the 30 splice sites of longer introns, both in absolute terms and (Figures 5A and S5G). Although overall fewer cassette exons relative to other splicing factors (Figures 5E and 5F). Differential are regulated in this mutant (n = 81), exons are predominantly FUBP1 binding was not observed for other exon-intron-related skipped (n = 45), and these are flanked by longer introns (Fig- features, such as splice site, Py tract, and BP strength ure S5J). Together, these data reveal that FUBP1 is important (Figures S6E–S6H). Furthermore, longer introns exhibit a marked for the splicing of long introns and suggest a functional role for enrichment of FUBP1motifs upstream of the BP (Figures 5G and the N-box in this process. 5H). By contrast, random motif occurrences or splice site To investigate whether FUBP1 mutations in tumor cells affect strength are independent of intron length (Figures S6I and splicing, we analyzed data from glioma patients.1 Intriguingly, we S6J). Moreover, long introns were previously observed to prefer- found that skipped exons in patients with FUBP1 loss-of-func- entially locate to the nuclear periphery and exhibit a differential tion mutations have longer adjacent introns than exons dysregu- GC content architecture.64,65 Indeed, we found that the occur- lated in patients harboring other splicing factor mutations rence of FUBP1 bindingmotifs correlates with theGCcontent ar- (Figures 5C and S6A). The effect is also evident upon FUBP1 chitecture (Figures S6K–S6M). Furthermore, FUBP1 binds stron- knockdown in the glioblastoma cell line U87MG from the same ger to introns located in the nuclear periphery (Figure S6N) and to study (Figure 5C). Together, these data strongly suggest that splice sites of exons with differential GC content architecture FUBP1 plays a role in the efficient splicing of long introns, (Figures S7A–S7C). Further analysis indicated that both intron thereby affecting the inclusion of adjacent exons. length and differential GC content architecture affect FUBP1 To validate the role of FUBP1 for long introns, we constructed binding (Figure S7D). a minigene for the alternative exon 18 in the MPDZ transcript, Although splicing is an ancient molecular mechanism, gene ar- which is skipped upon FUBP1 KO in RPE1 cells. The minigene chitecture and especially intron length are subject to substantial comprises the alternative exon with the flanking constitutive evolutionary change (Figure S7E). We hypothesized that FUBP1 exons and intervening long introns (>2.4 kb). In vivo iCLIP data is present throughout Eukaryota and that lineage-specific losses show that FUBP1 binds at both 30 splice sites, which was or modifications of FUBP1 are accompanied by changes in confirmed in vitro by EMSA with FUBP1N-box+KH (aa 1–457; average intron length. Indeed, we find overall that FUBP1 is well Figures S6B and S6C). We observed amarked decrease of alter- conserved. Although losses do occur, they are mostly observed native exon inclusion from the MPDZ minigene in FUBP1 KO in taxa with short introns such as protozoa and fungi (Figures 5I (16% inclusion) and an intermediate effect (25%) in FUBP1- and 5J). Species with FUBP1 consistently harbor more FUBP1 Nboxmut cells, compared with wild-type (WT) cells (31% inclu- motifs at their 30 splice sites (Figure 5K). By contrast, U-richmotifs sion; Figures 5D, S6B, and S6D). Upon mutation of the FUBP1 interspersed with C, which do not accumulate in the region of (C) Junction length for less-included exons in RNA-seq from glioma patients with FUBP1 loss-of-function (LoF) mutations, from a FUBP1 siRNA knockdown in U87MG cells, and from SF3B1/U2AF1/SRSF2 hotspot mutations and RBM10 LoF mutation in different cancer patient samples. ***p < 0.001. (D) Changes of exon inclusion (n = 3) in FUBP1WT, FUBP1-Nboxmut, and FUBP1KORPE1 cell lines upon intron shortening and/or removal of FUBP1 binding sites in theMPDZminigene (Figure S6B). Data are represented as mean ± SD. Significance was determined by a two-sided Student’s t-test with Benjamini-Hochberg correction. Red dots represent FUBP1 binding sites. *p < 0.05, **p < 0.01, ***p < 0.001, n.s., not significant. (E) Metaprofile showing FUBP1 cross-link events relative to branch point for various intron lengths. iCLIP signals were normalized for expression and averaged per nucleotide over all introns. (F) Quantification of binding signal based on area-under-the-curve (AUC) inmain binding regions (see STARMethods for details). Binding enrichment is defined as log2 fold change of AUC over AUC of introns with length in (100 nt, 400 nt). (G) Positional enrichment of FUBP1 binding motifs and control motifs relative to branch point and for various intron lengths. UUU+A/G/C, sets of four 4-mers containing UUU interspersed at any position with A/G/C. NNNN, 100,000 sets of random combinations of four 4-mers. 4-mer frequencies were calculated position-wise upstream of the BP and compared with average 4-mer frequencies in intronic control region. (H) Number of FUBP1 binding motifs upstream of the BP (["100 nt; "26 nt]) for various intron lengths ([500, 1,000), n = 24,564 introns; [1,000, 2,000], n = 32,251 introns; [2,000, 4,000], n = 31,734 introns; [4,000, 17,000], n = 38,692). (I) Phylogenetic profile of FUBP1. Tree indicates taxonomic range scanned for presence of FUBP1 orthologs. Fractions of species harboring ortholog to human FUBP1 (left) and carrying the A/B boxes (right) are shown. (J) FUBP1 presence compared to median intron length per species. ***p < 0.001. (K) Percentage of introns with at least one FUBP1 motif or control motifs present in 25-nt window located 25 nt upstream of the 30 splice site. 2662 Molecular Cell 83, 2653–2672, August 3, 2023 54 ll Article OPEN ACCESS A B C D E F G H I (legend on next page) Molecular Cell 83, 2653–2672, August 3, 2023 2663 55 ll OPEN ACCESS Article FUBP1 binding (Figure 5G), are least enriched in species with exons are not defined as functional units, and intron splicing FUBP1. Comparing FUBP1’s domain architecture across eukary- solely requires U1 and U2 binding to flanking splice sites (Fig- otic evolution, we find that C-terminal A/B boxes are an animal- ure S7G, ‘‘intron definition model’’). Taken together, the experi- specific innovation. Their appearance in evolution is associated mental observations are consistent with the kinetic model, which with an overall increase in intron length in animals compared assumes that FUBP1 differentially affects long introns by pro- with other eukaryotes (Figure 5I). Together, this suggests that moting splice site pairing and the formation of catalytically active FUBP1 binding to its RNAmotifs and its protein-protein interfaces spliceosomes across long introns. play important roles in the splicing of long introns. To test this prediction, we investigated the cross-linking of FUBP1 to snRNAs, indicative of its presence at different stages FUBP1 interacts with both splice sites suggesting a of splicing. First, FUBP1 showed substantial cross-linking to U2 function in cross-intron bridging snRNA, consistent with FUBP1 binding upstream of the BPwhere To decipher the molecular mechanism of FUBP1 action, we theU2 snRNP replacesSF1, indicating that FUBP1 is present dur- developed a kinetic model of cassette exon splicing using ordi- ing A-complex formation (Figure 6C). More importantly, FUBP1 nary differential equations (Figures 6A and S7F; Table S4). In alsocross-links toU1snRNA,whichbinds to the50 splicesite, sug- line with our previous work,66 we considered a scenario for gesting that FUBP1 is present during the bridging of the 30 and 50 ‘‘exon definition’’ in which the U1 and U2 snRNPs recognize splice sites, either during initial exon definition or also at later the 50 and 30 regions flanking an exon as functional units. The stages of intron definition. The latter is further supported by the subsequent splice site pairing by U1/U2 snRNP interaction cross-linking of FUBP1 to U6 snRNA, which replaces U1 snRNA across the intron, that is, intron definition, triggers splicing catal- at the 50 splice site prior to lariat formation (Figure 6C). Hence, ysis, which results in either cassette exon inclusion, skipping, or FUBP1might be involved in intronbridging throughout the splicing intron retention in the model. We first simulated the loss of cycle. We next searched our iCLIP datasets for evidence that FUBP1 in a model in which FUBP1 solely acts on initial U1/U2 FUBP1 is still bound in the spliceosomalCcomplexwhen the lariat snRNP binding to exons (exon definition). However, our simula- has formed after the first splicing reaction. It has been shown that tions argue against a pure exon definition effect, as the model reads from the lariat truncateat thepositionwhere the50 splice site cannot recapitulate the splicing changes that occur upon is covalently linked to the BP and is detected as a single-nucleo- FUBP1 KO (Figure 6B; model 1). According to our experimental tide-wide peak at the 50 splice site (Figure 6D).68,69 Indeed, we data, exons flanked by two long introns are typically skipped observed a strong peak in read truncations for FUBP1 at the 50 upon FUBP1 KO, whereas exons flanked by at least one short splice site, whereas there was almost no signal for the other intron tend to show slightly increased inclusion. Surprisingly, splicing factors tested (Figure 6E). This suggests that FUBP1 is the experimental data are more consistent with an alternative present from the early stages of spliceosome assembly until at model in which FUBP1 enhances the pairing of splice sites least the first catalytic step of the splicing reaction. across long (but not short) introns during intron definition. The To further investigate whether FUBP1 is actively involved in model predicts reduced exon inclusion upon FUBP1 KO specif- splice-site bridging, we searched available binary protein-pro- ically for exons flanked by two long introns, whereas exons tein interaction data from yeast two-hybrid screens.67 These flanked by one short intron moderately increase, irrespective of data confirmed that FUBP1 binds to U2AF2 (Figure 6F). We whether it is located upstream or downstream (Figure 6B; model also found evidence for FUBP1 interacting with several U1-asso- 2). These results also hold true in a modified model, in which ciated proteins (SNRPA, SNRPC, TIAL1, and PRPF40B) as well Figure 6. FUBP1 interacts with U1 snRNP components (A) Kinetic model of FUBP1’s effects on alternative splicing quantitatively describes steady-state abundance of splice products for a three-exon gene in control and FUBP1 KO conditions. Two model variants were analyzed, in which FUBP1 affects the initial exon definition step near long introns (model 1), and the subsequent splicing reaction, promoting the excision of long introns (model 2). See STAR Methods for details. (B) Simulated splicing changes upon FUBP1KO reflect transcriptome-wide RNA-seq data assuming that FUBP1 affects splicing catalysis (model 2). To reflect the heterogeneity of exons in the human transcriptome, kinetic parameters of the model were chosen at random, giving rise to an ensemble of 10,000 in silico exons. FUBP1 KOwas simulated for each in silico exon, assuming that FUBP1 either enhances exon definition (model 1) or the rate of splicing (model 2) for long (but not short) introns (see STARMethods for details). In the data, significantly regulated cassette exons were classified based on flanking intron lengths (<400 nt = short, R400 nt = long). (C) Fraction of total reads mapping to snRNAs using custom reference consisting of snRNAs (n = 10), tRNAs (n = 22), and rRNAs (n = 6). (D) Schematic description of three-way junction of intron lariats. cDNAs can truncate not at the original protein-RNA interactions site but rather at the three-way junction. These cDNAs either start from the intron end and truncate at the BP or, alternatively, start downstream of the 50 splice site and truncate at the first nucleotide of the intron. (E) Metaprofiles showing cross-link events of FUBP1, U2AF2, SF3B1, SF1, and PTBP1 relative to the 50 splice site. iCLIP signals were normalized for expression and averaged per nucleotide. (F) Comprehensive interaction network of FUBP1 based on NMR, BRET, and published yeast two-hybrid data.67 (G) BRET measurements between FUBP1 and subunits of the U1 snRNP complex as well as U1 snRNP-associated proteins along with positive and negative control pairs. Biological replicates are shown. Error bars represent SD of technical triplicates. (H) NMR titration of FUBP1A/B with SNRPBP-rich up to a molar ratio of 1:2. Significant chemical shift changes are highlighted by boxes. (I) Percent-spliced-in (PSI) of MPDZ minigene upon transfection of WT and FUBP1 KO RPE1 cells with different FUBP1 constructs. Data are represented as mean ± SD. Significance was determined by a two-sided Student’s t-test with Benjamini-Hochberg correction. *p < 0.05, **p < 0.01, ***p < 0.001, n.s., not significant. 2664 Molecular Cell 83, 2653–2672, August 3, 2023 56 ll Article OPEN ACCESS A B Figure 7. FUBP1 acts at multiple steps of early spliceosomal assembly (A) The multiple roles of FUBP1 during spliceosomal complex assembly at the 30 splice site. (B) FUBP1 directly interacts with U2AF2, SF1, and additional U1/U2 snRNP components via distinct disordered interaction interfaces. as with SNRPB, which is a member of the Sm protein ring in all derstood. In this study, we identified FUBP1 as a key component snRNPs (Figure 6F). These and further interactions of FUBP1 in 30 splice site definition. We found that FUBP1 recognizes clus- with U1-associated proteins (TCERG1 and KHDRBS1) were tered U-rich elements interspersed by A or G that are present at confirmed using the BRET assay and/or NMR (Figures 6F, 6G, virtually all 30 splice sites and are most abundant for longer in- and S7H). Interestingly, several of the U1 snRNP-associated trons. Until now, four conserved intron-defining sequence motifs proteins harbor proline-rich regions, which potentially interact were known: the 50 splice site motif, the BP sequence, the Py with the A/B boxes in FUBP1, similar to the FUBP1-SF1 interac- tract, and the 30 splice site motif.6 We propose the FUBP1 bind- tion discussed above. Indeed, we observed significant changes ing motif as a sequence signature that is relevant for spliceoso- in the NMR spectrum of FUBP1A/B upon the addition of a proline- mal assembly at long introns, which represent >80% of all hu- rich peptide from SNRPB (Figure 6H), which were less pro- man introns. Consistent with such a general role in splicing, nounced with SNRPA and PRPF40B derivates (Figures S7I and FUBP1 has been detected in purified spliceosomes using S7J). This correlates well with the proline-rich region in SNRPB mass spectrometry.70–72 being much larger than in SNRPA or PRPF40B and thus avidity We show that the four KH domains of FUBP1 recognize clus- effects perhaps enhance the binding. tered arrays of binding motifs upstream of the BP. Multivalent in- Finally, to confirm the importance of the FUBP1 A/B boxes and teractions enhance binding affinity by avidity and enable the their role in splice-site bridging, we performed a complementa- recognition of cis-elements in RNAs of variable length by tion assay by expressing full-length GFP-FUBP1 and different combining individual KH-RNA motif interactions where multiple mutants in both WT and FUBP1 KO RPE1 cells. Effects on clustered RNA motifs may be separated by variable nucleotide splicing were monitored using the co-transfected MPDZ mini- linkers.54 We find that the four KH domains are connected by gene. As expected, GFP-FUBP1 complements the FUBP1 KO flexible linkers, which facilitates scanning of extended RNA re- cells and rescues MPDZ exon inclusion close to WT levels gions. The recognition of clustered RNA motifs by multidomain (Figures 6I, S7K, and S7L). Importantly, expression of GFP- RBPs has been observed in IMP proteins and also involves FUBP1W586,615R (mutations in the A/B boxes) or FUBP1DC (com- four KH domains.55 This suggests that KH domains working in plete deletion of the C terminus) impairs complementation in concert might be a common mechanism for specifically recog- FUBP1 KO cells. The same was also observed if the interaction nizing clustered RNA motifs in extended RNA regions. with U2AF2 is perturbed by expressing either FUBP1A38D (N-box mutation) or FUBP1DN (complete deletion of the N terminus). FUBP1 engages inmultivalent interactions with 30 and 50 Overall, these data demonstrate that both the A/B boxes and splice site components the N-box in FUBP1, which mediate the interactions with factors We characterized two interfaces in FUBP1 that mediate protein- at the 50 and 30 splice sites, respectively, are functionally relevant protein interactions: the N-box and the A/B boxes that are for splicing. embedded in the intrinsically disordered N- and C-terminal re- gions of FUBP1, respectively. The N-box has been shown to DISCUSSION interact with the RRM domain of PUF60 for regulation of tran- scription.33,73,74 Here, we found that the FUBP1 N-box also FUBP1 is a general component of 30 splice site definition binds to the RRM2 domain of U2AF2 and thereby mediates a The recognition and pairing of splice sites, especially for the functional interaction during pre-mRNA splicing. The N-box many long introns in the human transcriptome, are not well un- binds RRM2 opposite its RNA-binding surface, and thus, RNA Molecular Cell 83, 2653–2672, August 3, 2023 2665 57 ll OPEN ACCESS Article binding and FUBP1 binding do not compete. Notably, we have to bring the splice sites together. Our data suggest that FUBP1— previously shown that the U2AF2 tandem domains adopt closed through multivalent interactions with pre-mRNA, proteins, and conformations and that RNA binding selects open arrange- snRNAs located at the 50 and 30 splice sites—adds to these con- ments.15,29,75 Thus, binding of FUBP1 to the helical face of tacts throughout the splicing cycle. This ismost pertinent for long U2AF2 RRM2might enhance RNA binding not only by stabilizing introns harboring multiple FUBP1 cis-regulatory motifs. U2AF2 on the RNA but also by shifting the tandemRRMarrange- In conclusion, we identify FUBP1 as a general splicing factor ments of U2AF2. that ubiquitously binds at 30 splice sites by means of a hitherto The A/B boxes of FUBP1 interact with intrinsically disordered unknown cis-regulatory RNA sequence motif. The binding of proline-rich sequences within several U1 and U2 snRNP-associ- FUBP1 and its interactions with multiple U1 and U2 snRNP com- ated proteins. This matches observations on the A/B boxes of ponents are pertinent to the efficient splicing of long introns. the FUBP1 ortholog PSI in Drosophila melanogaster, which have been shown to bind to a proline-rich region in snRNP-U1- Limitations of the study 70K.59 However, this region is not conserved in the human ortho- Uridines are particularly prone to UV cross-linking, which can log SNRNP70, and our BRET studies detected no such interac- introduce bias to motif identification by iCLIP. However, we tion between FUBP1 and SNRNP70. In general, linear motifs in observed similar motifs using methods that do not involve UV proline-rich regions are recognized by structured regions such cross-linking (NMR spectroscopy, ITC, and EMSA); therefore, as WW or SH3 domains.76 These interactions are generally we are confident that our conclusions in this regard are valid. weak but often enhanced by multivalent interactions.77–81 Inter- Upon depletion of FUBP1 in our KO or knockdown cell lines, estingly, the A/B boxes are unique to the FUBP family and other factors (such as the close paralog KHSRP) might, to appear to be unstructured regions in the ortholog PSI.59 It will some extent, take on the role of FUBP1. Together with cellular be interesting to learn how prevalent such an atypical mode of quality control mechanisms that degrade mis-spliced tran- proline-rich sequence binding is and how it impacts cellular scripts, this might reduce the effects of FUBP1 perturbation function. that we observed in our RNA-seq analysis. We might clarify such effects in the future by combing acute depletion of FUBP1contributes to spliceosome formation andguides FUBP1 by means of degron tags with analysis of nascent RNA. the splicing of long introns U2AF2 RRM2 and FUBP1 N-box interact with weak affinity in One important question is why FUBP1 is particularly relevant for the micromolar range. Although it is likely that the simultaneous long introns. Clearly, the splicing of long introns is difficult to binding of U2AF2 and FUBP1 to the RNA further stabilizes this achieve. For instance, it has been reported that exons flanking interaction, we cannot exclude the involvement of other factors. long introns are less included,82,83 and that the splice sites of In general, introns may be characterized by a multitude of fea- longer introns are stronger.84,85 Consequently, longer introns tures, among which length is just one. For example, intron length require more complex regulation, such as the switch from initial is known to correlate with elevated differential GC content and exon definition to cross-intron spliceosomal complexes.84,86 overall lower intron and exon GC content.65 In addition, genes During exon definition, splice sites are recognized and paired with longer introns have been shown to preferentially localize across the exon, which is thereby defined as a functional unit. to the nuclear periphery,64 and their transcripts therefore might During the subsequent switch to intron definition, the complex interact with different splicing factors than for genes at the nu- shifts to a cross-intron pairing of splice sites (Figure 7). Our clear center. The question of whether these attributes rather data suggest that FUBP1 acts at both steps. We propose that complement each other or are causally related remains to be during exon definition, FUBP1 stabilizes U2AF2 and SF1 at the answered. 30 splice site. FUBP1 can thus strengthen the initial recognition of 30 splice sites via its multivalent interactions with U2AF2, STAR+METHODS SF1, and pre-mRNA. The stabilization by FUBP1 and its interac- tions with theU1 snRNP across the exonmight thus contribute to Detailed methods are provided in the online version of this paper splice site recognition during exon definition.86,87 and include the following: The interactions between FUBP1 and U1 snRNP components might also be relevant after the switch from exon definition to d KEY RESOURCES TABLE cross-intron pairing. Consistent with this model, we found that d RESOURCE AVAILABILITY FUBP1 is still present at splice sites until the lariat is formed. In B Lead contact fact, FUBP1 forms cross-links to the U6 snRNA, which replaces B Materials availability U1 snRNA at the 50 splice site. This indicates a role for FUBP1 in B Data and code availability intron bridging during spliceosomal B-complex formation, d EXPERIMENTAL MODEL AND STUDY PARTICIPANT particularly for long introns, as our experimental data and kinetic DETAILS modeling suggest. B RPE1 cell lines and culture conditions Several mechanisms and contributions to splice site bridging B HeLa cell line and culture conditions have been suggested, for example, the interactions between B HEK cell line and culture conditions U1 and U2 snRNP proteins and RNA components88–90 and the B Recombinant protein expression U2AF-associated RNA helicase UAP56.91 It is conceivable that d METHOD DETAILS multiple contact sites act in concert to generate sufficient avidity B Establishing FUBP1 KO/Nboxmut cell lines 2666 Molecular Cell 83, 2653–2672, August 3, 2023 58 ll Article OPEN ACCESS B Immunoblotting AUTHOR CONTRIBUTIONS B RPE1 RNA-seq S.E., C.B., A. Busch, and A.D.L. performed the bioinformatic analyses. C.H., B HeLa RNA-seq H.-S.K., and S.M.-L. performed the structural, biophysical, and biochemical B Semi-quantitative RT-PCR experiments and analyses. M.M. Mulorz, A. Buchbender, F.X.R.S., L.L.A., B In vivo iCLIP H.H., K.T., andM.M.Möckel performed the functional genomics, in vitro iCLIP, B In vitro iCLIP and minigene reporter experiments. D.H., J.S., and M.W. performed the BRET B Protein expression and purification experiments. P.K. and S.L. performed the mathematical modeling. I.E. per- NMR spectroscopy formed the evolutionary analysis. S.E., C.H., M.M. Mulorz, K.Z., K.L., M.S.,B and J.K. designed the study and wrote the manuscript. All authors read and B In vitro binding assays commented on the manuscript. B BRET d QUANTIFICATION AND STATISTICAL ANALYSIS DECLARATION OF INTERESTS B Preprocessing of RNA-seq data Preprocessing of in vivo iCLIP data The authors declare no competing interests.B B Metaprofiles for in vivo iCLIP data Received: January 4, 2023 B iCLIP binding site definition (peak calling) Revised: May 19, 2023 B Saturation analysis Accepted: July 3, 2023 B Motif enrichment for in vivo iCLIP Published: July 27, 2023 B Motif enrichment upstream of branch points B Abundance of FUBP1 motif at 30 splice sites REFERENCES B Analysis of in vitro iCLIP data 1. Seiler, M., Peng, S., Agrawal, A.A., Palacino, J., Teng, T., Zhu, P., Smith, B Intron length analyses of RNA-seq data P.G., Cancer; Genome; Atlas; Research Network, Buonamici, S., and Yu, B ENCODE data analysis L. (2018). Somatic mutational landscape of splicing factor genes and B Splicing changes upon FUBP1 LoF mutations their functional consequences across 33 cancer types. Cell Rep. 23, B Mutations in FUBP1 in cancer patients 282–296.e4. https://doi.org/10.1016/j.celrep.2018.01.088. B Scoring of splice site features 2. Bonnal, S.C., López-Oreja, I., and Valcárcel, J. (2020). Roles and mech- B Evolutionary analyses anisms of alternative splicing in cancer – implications for care. Nat. Rev. B Analysis of RBP crosslinking to snRNAs Clin. Oncol. 17, 457–474. https://doi.org/10.1038/s41571-020-0350-x. B Subnuclear distribution of FUBP1-bound genes 3. Gebauer, F., Schwarzl, T., Valcárcel, J., and Hentze, M.W. (2021). RNA- B Mathematical modeling binding proteins in human genetic disease. Nat. Rev. Genet. 22, 185–198. https://doi.org/10.1038/s41576-020-00302-y. 4. Shi, Y. (2017). Mechanistic insights into precursor messenger RNA SUPPLEMENTAL INFORMATION splicing by the spliceosome. Nat. Rev. Mol. Cell Biol. 18, 655–670. https://doi.org/10.1038/nrm.2017.86. Supplemental information can be found online at https://doi.org/10.1016/j. molcel.2023.07.002. 5. Wilkinson, M.E., Charenton, C., and Nagai, K. (2020). RNA splicing by the spliceosome. Annu. Rev. Biochem. 89, 359–388. https://doi.org/10. 1146/annurev-biochem-091719-064225. ACKNOWLEDGMENTS 6. Wahl, M.C., Will, C.L., and Lu€hrmann, R. (2009). The spliceosome: design We thank all themembers of the Luck, Sattler, and König labs for their help and principles of a dynamic RNPmachine. Cell 136, 701–718. https://doi.org/ discussion. We thankMalgorzata Rogalska and Juan Valcárcel for discussions 10.1016/j.cell.2009.02.009. and comments on the manuscript, Philipp Trepte and the Wanker group for 7. Papasaikas, P., and Valcárcel, J. (2016). The spliceosome: the ultimate sharing protocols and reagents and for help in setting up BRET assays, Chris- RNA chaperone and sculptor. Trends Biochem. Sci. 41, 33–45. https:// tian Scha€fer for help with BRET assays, Eric Schumbera for help with BRET doi.org/10.1016/j.tibs.2015.11.003. data processing, Fridolin Kielisch for help with statistical analyses,Mario Keller 8. Berglund, J.A., Abovich, N., and Rosbash, M. (1998). A cooperative inter- for bioinformatics advice, André Mourão for SNRPBP-rich plasmid, Sam Asami action between U2AF65 and mBBP/SF1 facilitates branchpoint region and Gerd Gemmecker for support with NMR experiments, Manuel Kaulich for recognition. Genes Dev. 12, 858–867. https://doi.org/10.1101/gad.12. reagents, and Chris Smith and Jernej Ule for PTBP1-RB40 antibody and rese- 6.858. quencing. We thank Adrian Neal for editing and commenting on the manu- 9. Liu, Z., Luyten, I., Bottomley, M.J., Messias, A.C., Houngninou-Molango, script. We thank the Core Facilities at IMB, in particular Protein Production, Mi- S., Sprangers, R., Zanier, K., Kra€mer, A., and Sattler, M. (2001). Structural croscopy, Bioinformatics, Genomics, and Flow Cytometry. basis for recognition of the intron branch site RNA by splicing factor 1. We acknowledge IMB Genomics Core Facility and its NextSeq 500 Science 294, 1098–1102. https://doi.org/10.1126/science.1064719. sequencer (funded by the Deutsche Forschungsgemeinschaft [DFG, German 10. Selenko, P., Gregorovic, G., Sprangers, R., Stier, G., Rhani, Z., Kra€mer, Research Foundation] INST 247/870-1 FUGG) and access to NMR spectrom- A., and Sattler, M. (2003). Structural basis for the molecular recognition eters at Bavarian NMRCenter. This work was supported by DFG grants to K.L. between human splicing factors U2AF65 and SF1/mBBP. Mol. Cell 11, (LU 2568/1-1; SFB1551 Project no. 464588647), J.K. (SPP1935 Project no. 965–976. https://doi.org/10.1016/s1097-2765(03)00115-1. 273941853, KO4566/2-1, SFB1551 Project No. 464588647, TRR 319 Project no. 439669440, and GRK2526/1 Project no. 407023052), K.Z. (SPP1935 Proj- 11. Kielkopf, C.L., Rodionova, N.A., Green, M.R., and Burley, S.K. (2001). A ect no. 273941853), S.L. (LE 3473/2–3), and M.S. (SPP1935 Project no. novel peptide recognition mode revealed by the X-ray structure of a core 273941853, SA823/10-1, and SFB1035 Project no. 201302640). C.H. ac- U2AF35/U2AF65 heterodimer. Cell 106, 595–605. https://doi.org/10. knowledges the Fonds der Chemischen Industrie for Kekulé fellowship, and 1016/s0092-8674(01)00480-9. S.M.-L. acknowledges EUHorizon 2020 Research and Innovation program un- 12. Wu, S., Romfo, C.M., Nilsen, T.W., and Green, M.R. (1999). Functional der the Marie Sk1odovska-Curie grant agreement No. 792692. J.S. acknowl- recognition of the 30 splice site AG by the splicing factor U2AF35. edges a PhD stipend from IMB’s collaborative research initiative. Nature 402, 832–835. https://doi.org/10.1038/45590. Molecular Cell 83, 2653–2672, August 3, 2023 2667 59 ll OPEN ACCESS Article 13. Merendino, L., Guth, S., Bilbao, D., Martı́nez, C., and Valcárcel, J. (1999). 29. Voith von Voithenberg, L., Sánchez-Rico, C., Kang, H.-S., Madl, T., Inhibition of msl-2 splicing by Sex-lethal reveals interaction between Zanier, K., Barth, A., Warner, L.R., Sattler, M., and Lamb, D.C. (2016). U2AF35 and the 30 splice site AG. Nature 402, 838–841. https://doi. Recognition of the 30 splice site RNA by the U2AF heterodimer involves org/10.1038/45602. a dynamic population shift. Proc. Natl. Acad. Sci. USA 113. E7169– 14. Agrawal, A.A., Salsi, E., Chatrikhi, R., Henderson, S., Jenkins, J.L., E7175. https://doi.org/10.1073/pnas.1605873113. Green, M.R., Ermolenko, D.N., and Kielkopf, C.L. (2016). An extended 30. Kang, H.-S., Sánchez-Rico, C., Ebersberger, S., Sutandy, F.X.R., Busch, U2AF(65)–RNA-binding domain recognizes the 30 splice site signal. A., Welte, T., Stehle, R., Hipp, C., Schulz, L., Buchbender, A., et al. (2020). Nat. Commun. 7, 10950. https://doi.org/10.1038/ncomms10950. An autoinhibitory intramolecular interaction proof-reads RNA recognition 15. Mackereth, C.D., Madl, T., Bonnal, S., Simon, B., Zanier, K., Gasch, A., by the essential splicing factor U2AF2. Proc. Natl. Acad. Sci. USA 117, Rybin, V., Valcárcel, J., and Sattler, M. (2011). Multi-domain conforma- 7140–7149. https://doi.org/10.1073/pnas.1913483117. tional selection underlies pre-mRNA splicing regulation by U2AF. 31. Debaize, L., and Troadec, M.-B. (2019). The master regulator FUBP1: its Nature 475, 408–411. https://doi.org/10.1038/nature10171. emerging role in normal cell function and malignant development. Cell. Mol. Life Sci. 76, 259–281. https://doi.org/10.1007/s00018-018-2933-6. 16. Zamore, P.D., and Green, M.R. (1989). Identification, purification, and biochemical characterization of U2 small nuclear ribonucleoprotein auxil- 32. Duncan, R., Bazar, L., Michelotti, G., Tomonaga, T., Krutzsch, H., Avigan, iary factor. Proc. Natl. Acad. Sci. USA 86, 9243–9247. https://doi.org/10. M., and Levens, D. (1994). A sequence-specific, single-strand binding 1073/pnas.86.23.9243. protein activates the far upstream element of c-myc and defines a new DNA-binding motif. Genes Dev. 8, 465–480. https://doi.org/10.1101/ 17. Berglund, J.A., Chua, K., Abovich, N., Reed, R., and Rosbash, M. (1997). gad.8.4.465. The splicing factor BBP interacts specifically with the pre-mRNA branch- point sequence UACUAAC. Cell 89, 781–787. https://doi.org/10.1016/ 33. Liu, J., Kouzine, F., Nie, Z., Chung, H.-J., Elisha-Feil, Z., Weber, A., Zhao, s0092-8674(00)80261-5. K., and Levens, D. (2006). The FUSE/FBP/FIR/TFIIH system is a molecu- lar machine programming a pulse of c-myc expression. EMBO J. 25, 18. Crisci, A., Raleff, F., Bagdiul, I., Raabe, M., Urlaub, H., Rain, J.-C., and 2119–2130. https://doi.org/10.1038/sj.emboj.7601101. Kra€mer, A. (2015). Mammalian splicing factor SF1 interacts with SURP domains of U2 snRNP-associated proteins. Nucleic Acids Res. 43, 34. Cukier, C.D., Hollingworth, D., Martin, S.R., Kelly, G., Dı́az-Moreno, I., 10456–10473. https://doi.org/10.1093/nar/gkv952. and Ramos, A. (2010). Molecular basis of FIR-mediated c-myc transcrip- tional control. Nat. Struct. Mol. Biol. 17, 1058–1064. https://doi.org/10. 19. Wahl, M.C., and Lu€hrmann, R. (2015). SnapShot: spliceosome dynamics 1038/nsmb.1883. I. Cell 161, 1474–1474e1. https://doi.org/10.1016/j.cell.2015.05.050. 35. Li, H., Wang, Z., Zhou, X., Cheng, Y., Xie, Z., Manley, J.L., and Feng, Y. 20. Tholen, J., and Galej, W.P. (2022). Structural studies of the spliceosome: (2013). Far upstream element-binding protein 1 and RNA secondary bridging the gaps. Curr. Opin. Struct. Biol. 77, 102461. https://doi.org/10. structure both mediate second-step splicing repression. Proc. Natl. 1016/j.sbi.2022.102461. Acad. Sci. USA 110. E2687–E2695. https://doi.org/10.1073/pnas. 21. Ule, J., and Blencowe, B.J. (2019). Alternative splicing regulatory net- 1310607110. works: functions, mechanisms, and evolution. Mol. Cell 76, 329–345. 36. Hwang, I., Cao, D., Na, Y., Kim, D.-Y., Zhang, T., Yao, J., Oh, H., Hu, J., https://doi.org/10.1016/j.molcel.2019.09.017. Zheng, H., Yao, Y., and Paik, J. (2018). Far upstream element-binding 22. Zuo, P., and Maniatis, T. (1996). The splicing factor U2AF35 mediates protein 1 regulates LSD1 alternative splicing to promote terminal differ- critical protein-protein interactions in constitutive and enhancer-depen- entiation of neural progenitors. Stem Cell Reports 10, 1208–1221. dent splicing. Genes Dev. 10, 1356–1368. https://doi.org/10.1101/gad. https://doi.org/10.1016/j.stemcr.2018.02.013. 10.11.1356. 37. Jacob, A.G., Singh, R.K., Mohammad, F., Bebee, T.W., and Chandler, 23. Saulière, J., Sureau, A., Expert-Bezançon, A., and Marie, J. (2006). The D.S. (2014). The splicing factor FUBP1 is required for the efficient splicing polypyrimidine tract binding protein (PTB) represses splicing of exon of oncogene MDM2 pre-mRNA. J. Biol. Chem. 289, 17350–17364. 6B from the beta-tropomyosin pre-mRNA by directly interfering with https://doi.org/10.1074/jbc.M114.554717. the binding of the U2AF65 subunit. Mol. Cell. Biol. 26, 8755–8769. 38. Miro, J., Laaref, A.M., Rofidal, V., Lagrafeuille, R., Hem, S., Thorel, D., https://doi.org/10.1128/MCB.00893-06. Méchin, D., Mamchaoui, K., Mouly, V., Claustres, M., and Tuffery- 24. Soares, L.M.M., Zanier, K., Mackereth, C., Sattler, M., and Valcárcel, J. Giraud, S. (2015). FUBP1: a new protagonist in splicing regulation of (2006). Intron removal requires proofreading of U2AF/30 splice site recog- the DMD gene. Nucleic Acids Res. 43, 2378–2389. https://doi.org/10. nition by DEK. Science 312, 1961–1965. https://doi.org/10.1126/sci- 1093/nar/gkv086. ence.1128659. 39. Ni, X., Knapp, S., and Chaikuad, A. (2020). Comparative structural ana- 25. Warf, M.B., Diegel, J.V., von Hippel, P.H., and Berglund, J.A. (2009). The lyses and nucleotide-binding characterization of the four KH domains protein factors MBNL1 and U2AF65 bind alternative RNA structures to of FUBP1. Sci. Rep. 10, 13459. https://doi.org/10.1038/s41598-020- regulate splicing. Proc. Natl. Acad. Sci. USA 106, 9203–9208. https:// 69832-z. doi.org/10.1073/pnas.0900342106. 40. Wang, H., Zhang, R., Li, E., Yan, R., Ma, B., and Ma, Q. (2022). Pan-can- 26. Tavanez, J.P., Madl, T., Kooshapur, H., Sattler, M., and Valcárcel, J. cer transcriptome and immune infiltration analyses reveal the oncogenic (2012). hnRNP A1 proofreads 30 splice site recognition by U2AF. Mol. role of far upstream element-binding protein 1 (FUBP1). Front. Mol. Cell 45, 314–329. https://doi.org/10.1016/j.molcel.2011.11.033. Biosci. 9, 794715. https://doi.org/10.3389/fmolb.2022.794715. 27. Zarnack, K., König, J., Tajnik, M., Martincorena, I., Eustermann, S., 41. Elman, J.S., Ni, T.K., Mengwasser, K.E., Jin, D., Wronski, A., Elledge, Stévant, I., Reyes, A., Anders, S., Luscombe, N.M., and Ule, J. (2013). S.J., and Kuperwasser, C. (2019). Identification of FUBP1 as a long tail Direct competition between hnRNP C and U2AF65 protects the tran- cancer driver and widespread regulator of tumor suppressor and onco- scriptome from the exonization of Alu elements. Cell 152, 453–466. gene alternative splicing. Cell Rep. 28, 3435–3449.e5. https://doi.org/ https://doi.org/10.1016/j.cell.2012.12.023. 10.1016/j.celrep.2019.08.060. 28. Sutandy, F.X.R., Ebersberger, S., Huang, L., Busch, A., Bach, M., Kang, 42. Wang, J., Schultz, P.G., and Johnson, K.A. (2018). Mechanistic studies of H.-S., Fallmann, J., Maticzka, D., Backofen, R., Stadler, P.F., et al. (2018). a small-moleculemodulator of SMN2 splicing. Proc. Natl. Acad. Sci. USA In vitro iCLIP-based modeling uncovers how the splicing factor U2AF2 115. E4604–E4612. https://doi.org/10.1073/pnas.1800260115. relies on regulation by cofactors. Genome Res. 28, 699–713. https:// 43. König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, doi.org/10.1101/gr.229757.117. D.J., Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of 2668 Molecular Cell 83, 2653–2672, August 3, 2023 60 ll Article OPEN ACCESS hnRNP particles in splicing at individual nucleotide resolution. Nat. mammalian cells. Mol. Syst. Biol. 14, e8071. https://doi.org/10.15252/ Struct. Mol. Biol. 17, 909–915. https://doi.org/10.1038/nsmb.1838. msb.20178071. 44. Buchbender, A., Mutter, H., Sutandy, F.X.R., Körtel, N., Ha€nel, H., Busch, 59. Ignjatovic, T., Yang, J.-C., Butler, J., Neuhaus, D., and Nagai, K. (2005). A., Ebersberger, S., and König, J. (2020). Improved library preparation Structural basis of the interaction between P-element somatic inhibitor with the new iCLIP2 protocol. Methods 178, 33–48. https://doi.org/10. and U1-70k essential for the alternative splicing of P-element transpo- 1016/j.ymeth.2019.10.003. sase. J. Mol. Biol. 351, 52–65. https://doi.org/10.1016/j.jmb.2005. 04.077. 45. Valcárcel, J., Gaur, R.K., Singh, R., and Green, M.R. (1996). Interaction of U2AF65 RS region with pre-mRNA branch point and promotion of base 60. Labourier, E., Adams, M.D., and Rio, D.C. (2001). Modulation of pairing with U2 snRNA [corrected]. Science 273, 1706–1709. https:// P-element pre-mRNA splicing by a direct interaction between PSI and doi.org/10.1126/science.273.5282.1706. U1 snRNP 70K protein. Mol. Cell 8, 363–373. https://doi.org/10.1016/ s1097-2765(01)00311-2. 46. Singh, R., Valcárcel, J., and Green, M.R. (1995). Distinct binding specific- ities and functions of higher eukaryotic polypyrimidine tract-binding 61. Chung, H.-J., Liu, J., Dundr, M., Nie, Z., Sanford, S., and Levens, D. proteins. Science 268, 1173–1176. https://doi.org/10.1126/science. (2006). FBPs are calibrated molecular tools to adjust gene expression. 7761834. Mol. Cell. Biol. 26, 6584–6597. https://doi.org/10.1128/MCB.00754-06. 47. Sugimoto, Y., König, J., Hussain, S., Zupan, B., Curk, T., Frye, M., and 62. ENCODE Project Consortium (2012). An integrated encyclopedia of DNA Ule, J. (2012). Analysis of CLIP and iCLIP methods for nucleotide-resolu- elements in the human genome. Nature 489, 57–74. https://doi.org/10. tion studies of protein-RNA interactions. Genome Biol. 13, R67. https:// 1038/nature11247. doi.org/10.1186/gb-2012-13-8-r67. 63. Luo, Y., Hitz, B.C., Gabdank, I., Hilton, J.A., Kagda,M.S., Lam, B., Myers, 48. Gozani, O., Potashkin, J., and Reed, R. (1998). A potential role for U2AF- Z., Sud, P., Jou, J., Lin, K., et al. (2020). New developments on the SAP 155 interactions in recruiting U2 snRNP to the branch site. Mol. Cell. Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Biol. 18, 4752–4760. https://doi.org/10.1128/MCB.18.8.4752. Res. 48. D882–D889. https://doi.org/10.1093/nar/gkz1062. 64. Tammer, L., Hameiri, O., Keydar, I., Roy, V.R., Ashkenazy-Titelman, A., 49. Xue, Y., Zhou, Y., Wu, T., Zhu, T., Ji, X., Kwon, Y.-S., Zhang, C., Yeo, G., Custódio, N., Sason, I., Shayevitch, R., Rodrı́guez-Vaello, V., Rino, J., Black, D.L., Sun, H., et al. (2009). Genome-wide analysis of PTB-RNA in- et al. (2022). Gene architecture directs splicing outcome in separate nu- teractions reveals a strategy used by the general splicing repressor to clear spatial regions. Mol. Cell 82, 1021–1034.e8. https://doi.org/10. modulate exon inclusion or skipping. Mol. Cell 36, 996–1006. https:// 1016/j.molcel.2022.02.001. doi.org/10.1016/j.molcel.2009.12.003. 65. Amit, M., Donyo, M., Hollander, D., Goren, A., Kim, E., Gelfman, S., Lev- 50. Llorian, M., Schwartz, S., Clark, T.A., Hollander, D., Tan, L.-Y., Spellman, Maor, G., Burstein, D., Schwartz, S., Postolsky, B., et al. (2012). R., Gordon, A., Schweitzer, A.C., de la Grange, P., Ast, G., and Smith, Differential GC content between exons and introns establishes distinct C.W.J. (2010). Position-dependent alternative splicing activity revealed strategies of splice-site recognition. Cell Rep. 1, 543–556. https://doi. by global profiling of alternative splicing events regulated by PTB. Nat. org/10.1016/j.celrep.2012.03.013. Struct. Mol. Biol. 17, 1114–1123. https://doi.org/10.1038/nsmb.1881. 66. Enculescu, M., Braun, S., Thonta Setty, S., Busch, A., Zarnack, K., König, 51. Shao, C., Yang, B.,Wu, T., Huang, J., Tang, P., Zhou, Y., Zhou, J., Qiu, J., 0 J., and Legewie, S. (2020). Exon definition facilitates reliable control ofJiang, L., Li, H., et al. (2014). Mechanisms for U2AF to define 3 splice alternative splicing in the RON proto-oncogene. Biophys. J. 118, 2027– sites and regulate alternative splicing in the human genome. Nat. 2041. https://doi.org/10.1016/j.bpj.2020.02.022. Struct. Mol. Biol. 21, 997–1005. https://doi.org/10.1038/nsmb.2906. 67. Luck, K., Kim, D.-K., Lambourne, L., Spirohn, K., Begg, B.E., Bian, W., 52. Valverde, R., Edwards, L., and Regan, L. (2008). Structure and function of Brignall, R., Cafarelli, T., Campos-Laborie, F.J., Charloteaux, B., et al. KH domains. FEBS J. 275, 2712–2726. https://doi.org/10.1111/j.1742- (2020). A reference map of the human binary protein interactome. 4658.2008.06411.x. Nature 580, 402–408. https://doi.org/10.1038/s41586-020-2188-x. 53. Fukumura, K., Yoshimoto, R., Sperotto, L., Kang, H.-S., Hirose, T., Inoue, 68. Briese, M., Haberman, N., Sibley, C.R., Faraway, R., Elser, A.S., K., Sattler, M., and Mayeda, A. (2021). SPF45/RBM17-dependent, but Chakrabarti, A.M., Wang, Z., König, J., Perera, D., Wickramasinghe, not U2AF-dependent, splicing in a distinct subset of human short introns. V.O., et al. (2019). A systems view of spliceosomal assembly and branch- Nat. Commun. 12, 4910. https://doi.org/10.1038/s41467-021-24879-y. points with iCLIP. Nat. Struct. Mol. Biol. 26, 930–940. https://doi.org/10. 54. Mackereth, C.D., and Sattler, M. (2012). Dynamics in multi-domain pro- 1038/s41594-019-0300-4. tein recognition of RNA. Curr. Opin. Struct. Biol. 22, 287–296. https:// 69. Cordiner, R.A., Dou, Y., Thomsen, R., Bugai, A., Granneman, S., and doi.org/10.1016/j.sbi.2012.03.013. Heick Jensen, T. (2023). Temporal-iCLIP captures co-transcriptional 55. Schneider, T., Hung, L.-H., Aziz, M., Wilmen, A., Thaum, S., Wagner, J., RNA-protein interactions. Nat. Commun. 14, 696. https://doi.org/10. Janowski, R., Mu€ller, S., Schreiner, S., Friedhoff, P., et al. (2019). 1038/s41467-023-36345-y. Combinatorial recognition of clustered RNA elements by themultidomain 70. Rappsilber, J., Ryder, U., Lamond, A.I., andMann,M. (2002). Large-scale RNA-binding protein IMP3. Nat. Commun. 10, 2266. https://doi.org/10. proteomic analysis of the human spliceosome. Genome Res. 12, 1231– 1038/s41467-019-09769-8. 1245. https://doi.org/10.1101/gr.473902. 56. Siomi, H., Matunis, M.J., Michael, W.M., and Dreyfuss, G. (1993). The 71. Makarov, E.M., Owen, N., Bottrill, A., and Makarova, O.V. (2012). pre-mRNA binding K protein contains a novel evolutionarily conserved Functional mammalian spliceosomal complex E contains SMN complex motif. Nucleic Acids Res. 21, 1193–1198. https://doi.org/10.1093/nar/ proteins in addition to U1 and U2 snRNPs. Nucleic Acids Res. 40, 2639– 21.5.1193. 2652. https://doi.org/10.1093/nar/gkr1056. 57. Beuth, B., Garcı́a-Mayoral, M.F., Taylor, I.A., and Ramos, A. (2007). 72. Sharma, S., Kohlstaedt, L.A., Damianov, A., Rio, D.C., and Black, D.L. Scaffold-independent analysis of RNA-protein interactions: the Nova-1 (2008). Polypyrimidine tract binding protein controls the transition from KH3-RNA complex. J. Am. Chem. Soc. 129, 10205–10210. https://doi. exon definition to an intron defined spliceosome. Nat. Struct. Mol. Biol. org/10.1021/ja072365q. 15, 183–191. https://doi.org/10.1038/nsmb.1375. 58. Trepte, P., Kruse, S., Kostova, S., Hoffmann, S., Buntru, A., 73. Hsiao, H.-H., Nath, A., Lin, C.-Y., Folta-Stogniew, E.J., Rhoades, E., and Tempelmeier, A., Secker, C., Diez, L., Schulz, A., Klockmeier, K., et al. Braddock, D.T. (2010). Quantitative characterization of the interactions (2018). LuTHy: a double-readout bioluminescence-based two-hybrid among c-myc transcriptional regulators FUSE, FBP, and FIR. technology for quantitative mapping of protein-protein interactions in Biochemistry 49, 4620–4634. https://doi.org/10.1021/bi9021445. Molecular Cell 83, 2653–2672, August 3, 2023 2669 61 ll OPEN ACCESS Article 74. Liu, J., He, L., Collins, I., Ge, H., Libutti, D., Li, J., Egly, J.M., and Levens, pre-mRNA splicing. RNA Biol. 18, 2576–2593. https://doi.org/10.1080/ D. (2000). The FBP interacting repressor targets TFIIH to inhibit activated 15476286.2021.1932360. transcription. Mol. Cell 5, 331–341. https://doi.org/10.1016/s1097- 92. Linares, A.J., Lin, C.-H., Damianov, A., Adams, K.L., Novitch, B.G., and 2765(00)80428-1. Black, D.L. (2015). The splicing regulator PTBP1 controls the activity of 75. Huang, J.-R., Warner, L.R., Sanchez, C., Gabel, F., Madl, T., Mackereth, the transcription factor Pbx1 during neuronal differentiation. ELife 4, C.D., Sattler, M., and Blackledge, M. (2014). Transient electrostatic inter- e09268. https://doi.org/10.7554/eLife.09268. actions dominate the conformational equilibrium sampled by multido- 93. Delaglio, F., Grzesiek, S., Vuister, G.W., Zhu, G., Pfeifer, J., and Bax, A. main splicing factor U2AF65: a combined NMR and SAXS study. (1995). NMRPipe: a multidimensional spectral processing system based J. Am. Chem. Soc. 136, 7068–7076. https://doi.org/10.1021/ja502030n. on UNIX pipes. J. Biomol. NMR 6, 277–293. https://doi.org/10.1007/ 76. Macias, M.J., Wiesner, S., and Sudol, M. (2002). WW and SH3 domains, BF00197809. two different scaffolds to recognize proline-rich ligands. FEBS Lett. 513, 94. Lee, W., Tonelli, M., and Markley, J.L. (2015). NMRFAM-SPARKY: 30–37. https://doi.org/10.1016/s0014-5793(01)03290-2. enhanced software for biomolecular NMR spectroscopy. Bioinformatics 77. Ball, L.J., Ku€hne, R., Schneider-Mergener, J., and Oschkinat, H. (2005). 31, 1325–1327. https://doi.org/10.1093/bioinformatics/btu830. Recognition of proline-rich motifs by protein-protein-interaction do- 95. Gu€ntert, P. (2009). Automated structure determination from NMR mains. Angew. Chem. Int. Ed. Engl. 44, 2852–2869. https://doi.org/10. spectra. Eur. Biophys. J. 38, 129–143. https://doi.org/10.1007/s00249- 1002/anie.200400618. 008-0367-z. 78. Zarrinpar, A., Bhattacharyya, R.P., and Lim, W.A. (2003). The structure 96. Shen, Y., Delaglio, F., Cornilescu, G., and Bax, A. (2009). TALOS+: a and function of proline recognition domains. Sci. STKE 2003, RE8. hybrid method for predicting protein backbone torsion angles from https://doi.org/10.1126/stke.2003.179.re8. NMR chemical shifts. J. Biomol. NMR 44, 213–223. https://doi.org/10. 79. Kofler, M.M., and Freund, C. (2006). The GYF domain. FEBS J. 273, 1007/s10858-009-9333-z. 245–256. https://doi.org/10.1111/j.1742-4658.2005.05078.x. 97. Rieping, W., Habeck, M., Bardiaux, B., Bernard, A., Malliavin, T.E., and 80. Sudol, M. (1996). Structure and function of the WW domain. Prog. Nilges, M. (2007). ARIA2: automated NOE assignment and data integra- Biophys. Mol. Biol. 65, 113–132. https://doi.org/10.1016/s0079- tion in NMR structure calculation. Bioinformatics 23, 381–382. https:// 6107(96)00008-9. doi.org/10.1093/bioinformatics/btl589. 81. Mayer, B.J. (2001). SH3 domains: complexity in moderation. J. Cell Sci. 98. Laskowski, R.A., Rullmannn, J.A., MacArthur, M.W., Kaptein, R., and 114, 1253–1263. https://doi.org/10.1242/jcs.114.7.1253. Thornton, J.M. (1996). Aqua and PROCHECK-NMR: programs for check- 82. Bell, M.V., Cowper, A.E., Lefranc, M.P., Bell, J.I., and Screaton, G.R. ing the quality of protein structures solved by NMR. J. Biomol. NMR 8, (1998). Influence of intron length on alternative splicing of CD44. Mol. 477–486. https://doi.org/10.1007/BF00228148. Cell. Biol. 18, 5930–5941. https://doi.org/10.1128/MCB.18.10.5930. 99. Bhattacharya, A., Tejero, R., andMontelione, G.T. (2007). Evaluating pro- 83. Fox-Walsh, K.L., Dou, Y., Lam, B.J., Hung, S.-P., Baldi, P.F., and Hertel, tein structures determined by structural genomics consortia. Proteins 66, K.J. (2005). The architecture of pre-mRNAs affects mechanisms 778–795. https://doi.org/10.1002/prot.21165. of splice-site pairing. Proc. Natl. Acad. Sci. USA 102, 16176–16181. 100. Koradi, R., Billeter, M., and Wu€thrich, K. (1996). MOLMOL: A program for https://doi.org/10.1073/pnas.0508489102. display and analysis of macromolecular structures. J. Mol. Graph. 14, 84. Dewey, C.N., Rogozin, I.B., and Koonin, E.V. (2006). Compensatory rela- 51–55. https://doi.org/10.1016/0263-7855(96)00009-4. tionship between splice sites and exonic splicing signals depending on 101. Schrödinger, L., and DeLano, W. (2020). PyMOL. http://www.pymol. the length of vertebrate introns. BMC Genomics 7, 311. https://doi.org/ org/pymol. 10.1186/1471-2164-7-311. 102. Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., 85. Gelfman, S., Burstein, D., Penn, O., Savchenko, A., Amit, M., Schwartz, Pietzsch, T., Preibisch, S., Rueden, C., Saalfeld, S., Schmid, B., et al. S., Pupko, T., and Ast, G. (2012). Changes in exon-intron structure during (2012). Fiji: an open-source platform for biological-image analysis. Nat. vertebrate evolution affect the splicing pattern of exons. Genome Res. Methods 9, 676–682. https://doi.org/10.1038/nmeth.2019. 22, 35–50. https://doi.org/10.1101/gr.119834.110. 103. Coleman, T., Branch, M.A., and Grace, A. (1999). Optimization Toolbox. 86. De Conti, L., Baralle, M., and Buratti, E. (2013). Exon and intron definition For Use with MATLAB. User’s guide. The MathWorks Inc, Ver. 2. in pre-mRNA splicing. Wiley Interdiscip. Rev. RNA 4, 49–60. https://doi. 104. R Core Team (2016). R: A Language and Environment for Statistical org/10.1002/wrna.1140. Computing. R Foundation for Statistical Computing. http://www.R- 87. Schneider, M., Will, C.L., Anokhina, M., Tazi, J., Urlaub, H., and project.org/. Lu€hrmann, R. (2010). Exon definition complexes contain the tri-snRNP 105. Vaquero-Garcia, J., Barrera, A., Gazzara, M.R., González-Vallinas, J., and can be directly converted into B-like precatalytic splicing complexes. Lahens, N.F., Hogenesch, J.B., Lynch, K.W., and Barash, Y. (2016). A Mol. Cell 38, 223–235. https://doi.org/10.1016/j.molcel.2010.02.027. new view of transcriptome complexity and regulation through the lens 88. Sharma, S., Wongpalee, S.P., Vashisht, A., Wohlschlegel, J.A., and of local splicing variations. ELife 5, e11752. https://doi.org/10.7554/ Black, D.L. (2014). Stem-loop 4 of U1 snRNA is essential for splicing eLife.11752. and interacts with the U2 snRNP-specific SF3A1 protein during spliceo- 106. Dosch, J., Bergmann, H., Tran, V., and Ebersberger, I. (2023). FAS: as- some assembly. Genes Dev. 28, 2518–2531. https://doi.org/10.1101/ sessing the similarity between proteins using multi-layered feature archi- gad.248625.114. tectures. Bioinformatics 39, btad226. https://doi.org/10.1093/bioinfor- 89. Martelly, W., Fellows, B., Senior, K., Marlowe, T., and Sharma, S. (2019). matics/btad226. Identification of a noncanonical RNA binding domain in the U2 107. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., snRNP protein SF3A1. RNA 25, 1509–1521. https://doi.org/10.1261/ Batut, P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast uni- rna.072256.119. versal RNA-seq aligner. Bioinformatics 29, 15–21. https://doi.org/10. 90. Plaschka, C., Lin, P.-C., Charenton, C., and Nagai, K. (2018). 1093/bioinformatics/bts635. Prespliceosome structure provides insights into spliceosome assembly 108. Martin, M. (2011). Cutadapt removes adapter sequences from high- and regulation. Nature 559, 419–422. https://doi.org/10.1038/s41586- throughput sequencing reads. EMBnet J. 17, 10–12. https://doi.org/10. 018-0323-8. 14806/ej.17.1.200. 91. Martelly, W., Fellows, B., Kang, P., Vashisht, A., Wohlschlegel, J.A., and 109. Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, Sharma, S. (2021). Synergistic roles for human U1 snRNA stem-loops in M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., and Li, 2670 Molecular Cell 83, 2653–2672, August 3, 2023 62 ll Article OPEN ACCESS H. (2021). Twelve years of SAMtools and BCFtools. Gigascience 10, 125. Zwahlen, C., Gardner, K.H., Sarma, S.P., Horita, D.A., Byrd, R.A., and giab008. https://doi.org/10.1093/gigascience/giab008. Kay, L.E. (1998). An NMR experiment for measuring methyl-methyl 110. Liao, Y., Smyth, G.K., and Shi, W. (2014). featureCounts: an efficient gen- NOEs in 13C-labeled proteins with high resolution. J. Am. Chem. Soc. eral purpose program for assigning sequence reads to genomic features. 120, 7617–7625. https://doi.org/10.1021/ja981205z. Bioinformatics 30, 923–930. https://doi.org/10.1093/bioinformatics/ 126. Marsh, J.A., Singh, V.K., Jia, Z., and Forman-Kay, J.D. (2006). Sensitivity btt656. of secondary structure propensities to sequence differences between alpha- and gamma-synuclein: implications for fibrillation. Protein Sci. 111. Roehr, J.T., Dieterich, C., and Reinert, K. (2017). Flexbar 3.0 - SIMD and 15, 2795–2804. https://doi.org/10.1110/ps.062465306. multicore parallelization. Bioinformatics 33, 2941–2942. https://doi.org/ 10.1093/bioinformatics/btx330. 127. Linge, J.P., Williams, M.A., Spronk, C.A.E.M., Bonvin, A.M.J.J., and Nilges, M. (2003). Refinement of protein structures in explicit solvent. 112. Krakau, S., Richard, H., and Marsico, A. (2017). PureCLIP: capturing Proteins 50, 496–506. https://doi.org/10.1002/prot.10299. target-specific protein-RNA interaction footprints from single-nucleotide CLIP-seq data. Genome Biol. 18, 240. https://doi.org/10.1186/s13059- 128. Bru€nger, A.T., Adams, P.D., Clore, G.M., DeLano,W.L., Gros, P., Grosse- 017-1364-2. Kunstleve, R.W., Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N.S., et al. (1998). Crystallography & NMR system: a new software suite for macro- 113. Lorenz, R., Bernhart, S.H., Höner Zu Siederdissen, C., Tafer, H., Flamm, molecular structure determination. Acta Crystallogr. D Biol. Crystallogr. C., Stadler, P.F., and Hofacker, I.L. (2011). ViennaRNA package 2.0. 54, 905–921. https://doi.org/10.1107/s0907444998003254. Algorithms Mol. Biol. 6, 26. https://doi.org/10.1186/1748-7188-6-26. 129. Messias, A.C., and Sattler, M. (2004). Structural basis of single-stranded 114. Huppertz, I., Attig, J., D’Ambrogio, A., Easton, L.E., Sibley, C.R., RNA recognition. Acc. Chem. Res. 37, 279–287. https://doi.org/10.1021/ Sugimoto, Y., Tajnik, M., König, J., and Ule, J. (2014). iCLIP: protein– ar030034m. RNA interactions at nucleotide resolution. Methods 65, 274–287. https://doi.org/10.1016/j.ymeth.2013.10.011. 130. Wiemann, S., Pennacchio, C., Hu, Y., Hunter, P., Harbers, M., Amiet, A., Bethel, G., Busse, M., Carninci, P., Dunham, I., et al. (2016). The 115. Spellman, R., Llorian, M., and Smith, C.W.J. (2007). Crossregulation and ORFeome Collaboration: a genome-scale human ORF-clone resource. functional redundancy between the splicing regulator PTB and its paral- Nature Methods 13, 191–192. ogs nPTB and ROD1. Mol. Cell 27, 420–434. https://doi.org/10.1016/j. 131. Frankish, A., Diekhans, M., Ferreira, A.-M., Johnson, R., Jungreis, I., molcel.2007.06.016. Loveland, J., Mudge, J.M., Sisu, C., Wright, J., Armstrong, J., et al. 116. Coelho, M.B., Attig, J., Bellora, N., König, J., Hallegger, M., Kayikci, M., (2019). GENCODE reference annotation for the human and mouse ge- Eyras, E., Ule, J., and Smith, C.W.J. (2015). Nuclear matrix protein nomes. Nucleic Acids Res. 47. D766–D773. https://doi.org/10.1093/ Matrin3 regulates alternative splicing and forms overlapping regulatory nar/gky955. networks with PTB. EMBO J. 34, 653–668. https://doi.org/10.15252/ 132. Busch, A., Bru€ggemann, M., Ebersberger, S., and Zarnack, K. (2020). embj.201489852. iCLIP data analysis: a complete pipeline from sequencing reads to 117. Grzesiek, S., and Bax, A. (1992). Correlating backbone amide and side RBP binding sites. Methods 178, 49–62. https://doi.org/10.1016/j. chain resonances in larger proteins by multiple relayed triple resonance ymeth.2019.11.008. NMR. J. Am. Chem. Soc. 114, 6291–6293. https://doi.org/10.1021/ 133. Paggi, J.M., and Bejerano, G. (2018). A sequence-based, deep learning ja00042a003. model accurately predicts RNA splicing branchpoints. RNA 24, 1647– 118. Sattler, M., Schleucher, J., and Griesinger, C. (1999). Heteronuclear 1658. https://doi.org/10.1261/rna.066290.118. multidimensional NMR experiments for the structure determination of 134. Hinrichs, A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G., proteins in solution employing pulsed field gradients. Prog. Nucl. Clawson, H., Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F., et al. Magn. Reson. Spectrosc. 34, 93–158. https://doi.org/10.1016/s0079- (2006). The UCSC genome browser database: update 2006. Nucleic 6565(98)00025-9. Acids Res. 34. D590–D598. https://doi.org/10.1093/nar/gkj144. 119. Wishart, D.S., and Sykes, B.D. (1994). The 13C chemical-shift index: a 135. Green, C.J., Gazzara, M.R., and Barash, Y. (2018). MAJIQ-SPEL: web- simple method for the identification of protein secondary structure using tool to interrogate classical and complex splicing variations from RNA- 13C chemical-shift data. J. Biomol. NMR 4, 171–180. https://doi.org/10. Seq data. Bioinformatics 34, 300–302. https://doi.org/10.1093/bioinfor- 1007/BF00175245. matics/btx565. 120. Saitô, H. (1986). Conformation-dependent 13C chemical shifts: a new 136. Norton, S.S., Vaquero-Garcia, J., Lahens, N.F., Grant, G.R., and Barash, means of conformational characterization as obtained by high-resolution Y. (2018). Outlier detection for improved differential splicing quantifica- solid-state 13C NMR. Magn. Reson. Chem. 24, 835–852. https://doi.org/ tion from RNA-Seq experiments with replicates. Bioinformatics 34, 10.1002/mrc.1260241002. 1488–1497. https://doi.org/10.1093/bioinformatics/btx790. 121. Kjaergaard, M., and Poulsen, F.M. (2011). Sequence correction of 137. Zhang, J., Bajari, R., Andric, D., Gerthoffert, F., Lepsa, A., Nahal-Bose, random coil chemical shifts: correlation between neighbor correction H., Stein, L.D., and Ferretti, V. (2019). The International Cancer factors and changes in the Ramachandran distribution. J. Biomol. NMR Genome Consortium data portal. Nat. Biotechnol. 37, 367–369. https:// 50, 157–165. https://doi.org/10.1007/s10858-011-9508-2. doi.org/10.1038/s41587-019-0055-9. 122. Farrow, N.A., Muhandiram, R., Singer, A.U., Pascal, S.M., Kay, C.M., 138. Cerami, E., Gao, J., Dogrusoz, U., Gross, B.E., Sumer, S.O., Aksoy, B.A., Gish, G., Shoelson, S.E., Pawson, T., Forman-Kay, J.D., and Kay, L.E. Jacobsen, A., Byrne, C.J., Heuer, M.L., Larsson, E., et al. (2012). The cBio (1994). Backbone dynamics of a free and phosphopeptide-complexed cancer genomics portal: an open platform for exploring multidimensional Src homology 2 domain studied by 15N NMR relaxation. Biochemistry cancer genomics data. Cancer Discov. 2, 401–404. https://doi.org/10. 33, 5984–6003. https://doi.org/10.1021/bi00185a040. 1158/2159-8290.CD-12-0095. 123. Mulder, F.A., Schipper, D., Bott, R., and Boelens, R. (1999). Altered flex- 139. Gao, J., Aksoy, B.A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S.O., ibility in the substrate-binding site of related native and engineered high- Sun, Y., Jacobsen, A., Sinha, R., Larsson, E., et al. (2013). Integrative alkalineBacillus subtilisins. J. Mol. Biol. 292, 111–123. https://doi.org/10. analysis of complex cancer genomics and clinical profiles using 1006/jmbi.1999.3034. the cBioPortal. Sci. Signal. 6, pl1. https://doi.org/10.1126/scisignal. 124. Williamson, M.P. (2013). Using chemical shift perturbation to character- 2004088. ise ligand binding. Prog. Nucl. Magn. Reson. Spectrosc. 73, 1–16. 140. Karczewski, K.J., Francioli, L.C., Tiao, G., Cummings, B.B., Alföldi, J., https://doi.org/10.1016/j.pnmrs.2013.02.001. Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., Birnbaum, D.P., Molecular Cell 83, 2653–2672, August 3, 2023 2671 63 ll OPEN ACCESS Article et al. (2020). Themutational constraint spectrum quantified from variation dence. Nucleic Acids Res. 46. D1062–D1067. https://doi.org/10.1093/ in 141,456 humans. Nature 581, 434–443. https://doi.org/10.1038/ nar/gkx1153. s41586-020-2308-7. 144. Yeo, G., and Burge, C.B. (2004). Maximum entropy modeling of short 141. Tate, J.G., Bamford, S., Jubb, H.C., Sondka, Z., Beare, D.M., Bindal, N., sequence motifs with applications to RNA splicing signals. J. Comput. Boutselakis, H., Cole, C.G., Creatore, C., Dawson, E., et al. (2019). Biol. 11, 377–394. https://doi.org/10.1089/1066527041410418. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids 145. Birikmen, M., Bohnsack, K.E., Tran, V., Somayaji, S., Bohnsack, M.T., Res. 47. D941–D947. https://doi.org/10.1093/nar/gky1015. and Ebersberger, I. (2021). Tracing eukaryotic ribosome biogenesis 142. Grossman, R.L., Heath, A.P., Ferretti, V., Varmus, H.E., Lowy, D.R., factors into the archaeal domain sheds light on the evolution of functional Kibbe, W.A., and Staudt, L.M. (2016). Toward a shared vision for cancer complexity. Front. Microbiol. 12, 739000. https://doi.org/10.3389/fmicb. genomic data. N. Engl. J. Med. 375, 1109–1112. https://doi.org/10.1056/ 2021.739000. NEJMp1607591. 146. Smith, T., Heger, A., and Sudbery, I. (2017). UMI-tools: modeling 143. Landrum, M.J., Lee, J.M., Benson, M., Brown, G.R., Chao, C., sequencing errors in Unique Molecular Identifiers to improve quantifica- Chitipiralla, S., Gu, B., Hart, J., Hoffman, D., Jang, W., et al. (2018). tion accuracy. Genome Res. 27, 491–499. https://doi.org/10.1101/gr. ClinVar: improving access to variant interpretations and supporting evi- 209601.116. 2672 Molecular Cell 83, 2653–2672, August 3, 2023 64 ll Article OPEN ACCESS STAR+METHODS KEY RESOURCES TABLE REAGENT or RESOURCE SOURCE IDENTIFIER Antibodies Rabbit anti-FUBP1 GeneTex Cat# GTX104579; RRID: AB_11165485 Mouse anti-U2AF2 Sigma-Aldrich Cat# U4758; RRID: AB_262122 Mouse anti-SF3B1 MBL Cat# D221-3; RRID: AB_592712 Mouse anti-SF1 Abnova Cat# H00007536-M01A; RRID: AB_10774630 rabbit anti-PTBP1 Christopher Smith Linares et al.92 Mouse anti-vinculin Sigma-Aldrich Cat# V9264; RRID: AB_10603627 Goat anti-rabbit IgG, HRP-linked Cell Signaling Cat# 7074S; RRID: AB_2099233 Horse anti-mouse IgG, HRP-linked Cell Signaling Cat# 7076S; RRID: AB_330924 Bacterial and virus strains DH5alpha Invitrogen Cat# 18265017 MACH1 Invitrogen Cat# C862003 E. coli BL21-CodonPlus (DE3)-RIL Agilent Cat# 230245 E. coli BL21 (DE3) Sigma-Aldrich Cat# CMC0014 Chemicals, peptides, and recombinant proteins FUGENE HD reagent Promega Cat# E2311 Lipofectamine CRISPRMAX reagent Thermo Fisher Cat# CMAX00001 Lipofectamine RNAimax Thermo Fisher Cat# 13778150 Lipofectamine 2000 Invitrogen Cat# 11668019 cOmplete Protease-Inhibitor Mix Sigma-Aldrich Cat# 4693159001 TURBO DNase Thermo Fisher Cat# AM2238 SuperSignal West PICO Chemiluminescent Substrate Thermo Fisher Cat# 15626144 4-thiouridine Sigma-Aldrich Cat# T4509-25MG T4 RNA ligase New England Biolabs Cat# M0202S T4 RNA ligase 1 New England Biolabs Cat# M0437M pCp-Cy5 Jena Bioscience Cat# NU-1706-CY5 T7 RNA polymerase Geerlof A., Protein Expression and N/A Purification Facility, HMGU Munich Pfu DNA Polymerase Promega Cat# M7741 OneTaq DNA Polymerase New England Biolabs Cat# M0480S Phusion High-Fidelity DNA Polymerase New England Biolabs Cat# M0530S Critical commercial assays TranscriptAid Enzyme Mix Thermo Fisher Cat# K0441 GeneArt Genomic Cleavage Detection Assay Thermo Fisher Cat# A24372 Zero Blunt TOPO PCR Cloning Kit Thermo Fisher Cat# 451245 RNeasy PLUS Mini Kit Qiagen Cat# 74034 TruSeq library preparation Kit ‘‘Ribo-Zero Gold’’ Illumina Cat# 20040526 RevertAid First Strand cDNA Synthesis Kit Thermo Fisher Cat# 10161310 Q5 Site-Directed Mutagenesis Kit New England Biolabs Cat# E0552S High Sensitivity D1000 ScreenTape Agilent Cat# 5067-5584 High Sensitivity RNA ScreenTape Agilent Cat# 5067-5579 NuPAGE 1 mm, 4-12% Bis-Tris Mini Protein Gel Thermo FIsher Cat# 12090156 HiScribe T7 High Yield RNA Synthesis Kit New England Biolabs Cat# E2040S (Continued on next page) Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e1 65 ll OPEN ACCESS Article Continued REAGENT or RESOURCE SOURCE IDENTIFIER ProNex Dual Size-Selective Purification System Promega Cat# NG2002 BP clonase II mix kit Invitrogen Cat# 10348582 LR clonase technology Invitrogen Cat# 11791020 Deposited data in vitro and in vivo iCLIP and RNA-Seq data This study GEO: GSE220186 Kinetic modeling of cassette exon splicing This study https://doi.org/10.5281/zenodo.8076768 Protein structure data This study PDB: 8P25 NMR data This study BMRB: 34816 Original Western blot, gel images and capillary This study, Mendeley Data https://doi.org/10.17632/nj8ybm8vb2.1 electrophoresis images RNA-Seq data: control and shRNA knockdown Luo et al.63, ENCODE: ENCSR260BQC (control) and for FUBP1 in K562 cells ENCODE Project Consortium62 ENCSR608IXR (FUBP1 KD) Differentially spliced junctions in splicing factor Seiler et al.1 Table S3 in Seiler et al. mutations Experimental models: Cell lines human: HeLa ATCC Cat# CCL-2, RRID:CVCL_0030 human: RPE1 FUBP1 WT: hTERT-RPE1 NatNeo Manuel Kaulich N/A Cas9 Mono Puro sens human: RPE1 FUBP1 KO: hTERT-RPE1 NatNeo This study N/A Cas9 Mono Puro sens FUBP1 -/- human: RPE1 FUBP1 Nbox-mut: hTERT-RPE1 This study N/A NatNeo Cas9 Mono Puro sens FUBP1 indel 31-40 human: HEK293 DSMZ ACC305 Oligonucleotides See Table S5 (too many oligos to list here) N/A Recombinant DNA See Table S6 (too many plasmids to list here) N/A Software and algorithms Topspin 3.5 Bruker https://www.bruker.com/en/products- and-solutions/mr/nmr-software/ topspin.html NMRpipe Delaglio et al.93 https://www.ibbr.umd.edu/nmrpipe/ index.html NMRFAM-Sparky Lee et al.94 https://nmrfam.wisc.edu/nmrfam-sparky- distribution/ CYANA 3.98.13 Gu€ntert95 https://cyana.org/wiki/Main_Page TALOS+ Shen et al.96 https://spin.niddk.nih.gov/bax/software/ TALOS/ ARIA2.3 Rieping et al.97 http://aria.pasteur.fr/ ProcheckNMR Laskowski et al.98 https://www.ebi.ac.uk/thornton-srv/ software/PROCHECK/ PSVS Bhattacharya et al.99 https://montelionelab.chem.rpi.edu/ PSVS/PSVS/ MolMol Koradi et al.100 https://sourceforge.net/p/molmol/wiki/ Home/ PYMOL Schrödinger and DeLano101 https://pymol.org/2/ ImageJ 2.1.0 Schindelin et al.102 https://imagej.net/ MicroCalPEAQ ITC Analysis software Malvern Panalytical https://www.malvernpanalytical.com/ Agilent TapeStation Software 5.1 Agilent https://www.agilent.com Image Lab 6.0.1 build 34 bio-rad https://www.bio-rad.com/ MATLAB Coleman et al.103 https://www.mathworks.com/ (Continued on next page) e2 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 66 ll Article OPEN ACCESS Continued REAGENT or RESOURCE SOURCE IDENTIFIER R 4.1.1. Core Team104 https://www.r-project.org/ MAJIQ v2.3 Vaquero-Garcia et al.105 https://majiq.biociphers.org/ FAS Dosch et al.106 https://github.com/BIONF/FAS fDOG N/A https://github.com/BIONF/fDOG STAR Dobin et al.107 https://github.com/alexdobin/STAR Cutadapt 2.4 Martin108 https://cutadapt.readthedocs.io/en/stable/ Samtools v1.9 Danecek et al.109 http://www.htslib.org/ Subread tool suite v1.6.2 Liao et al.110 https://subread.sourceforge.net/ FastQC v0.11.8 N/A https://www.bioinformatics.babraham.ac.uk/ projects/fastqc FASTX-Toolkit v0.0.14 N/A http://hannonlab.cshl.edu/fastx_toolkit/ seqtk v1.3 N/A https://github.com/lh3/seqtk/ Flexbar v3.4.0 Roehr et al.111 https://github.com/seqan/flexbar PureCLIP v1.3.1 Krakau et al.112 https://github.com/skrakau/PureCLIP ViennaRNA Package 2.4.17 Lorenz et al.113 https://www.tbi.univie.ac.at/RNA/ RESOURCE AVAILABILITY Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Julian König (j.koenig@imb-mainz.de). Materials availability All unique/stable reagents generated in this study are available from the lead contact. Data and code availability d RNA-seq, in vivo and in vitro iCLIP data have been deposited at GEO and are publicly available as of the date of publication. Accession numbers are listed in the key resources table. Protein structures have been deposited to the Protein Data Bank and are available under the accession number 8P25. NMR data used for structure calculation are deposited in the BMRB under the accession code 34816. Original Western blot, gel images and capillary electrophoresis images have been deposited at Men- deley Data and are publicly available as of the date of publication. The DOI is listed in the key resources table. d This paper analyses existing, publicly available data. These accession numbers for the datasets are listed in the key re- sources table. d All original code has been deposited at GitHub and is publicly available as of the date of publication at https://doi.org/10.5281/ zenodo.8076768. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request. EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS RPE1 cell lines and culture conditions The hTERT-RPE1 NatNeo Cas9 Mono Puro sens cell line was a generous gift of the Kaulich lab at the Frankfurt CRISPR/Cas Screening Center (FCSC) and are modified from original hTERT RPE1 cells (ATCC, CRL-4000). Cells were grown and maintained in Dulbecco’s modified Eagle’s medium (DMEM): Nutrient Mixture F-12 (DMEM/F-12; Thermo Fisher 11530566), supplemented with 10% fetal bovine serum (PAN-Biotech), 2 mM glutamine (Thermo Fisher), 1% penicillin–streptomycin (Thermo Fisher), and 20 mg/ml hygromycin B (Thermo Fisher). Cells were incubated at 37#C with 5% CO2. Subcultivation was performed with 3 ml of 0.1% trypsin every 2–3 days for 20 passages. After that, new cells were thawed from stocks containing 13106 cells in 1 ml of growth medium, supplemented with 10%DMSO and 50% fetal bovine serum (FBS). For semi-quantitative RT-PCR, 13105 RPE1 cells were seeded into one well of a six-well plate (Falcon), one day prior to transfection. DNA (2 mg) was diluted in 100 ml of OptiMEM and trans- fected with 6.4 ml of Fugene HD reagent (Promega). Cells were incubated at 37#C with 5% CO2 for 24 h before harvesting. For RNA- seq, 1.53106 cells were seeded in a 10-cm cell culture dish (Corning) 48 h prior to isolation. Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e3 67 ll OPEN ACCESS Article HeLa cell line and culture conditions HeLa cells (ATCC CCL-2) were grown and maintained in DMEM (Thermo Fisher), supplemented with 10% FBS, 2 mM glutamine (Thermo Fisher) and 1%penicillin–streptomycin (Thermo Fisher). Cells were incubated at 37#Cwith 5%CO2. Subcultivation was per- formed with 3 ml of 0.1% trypsin every 2–3 days for 20 passages. After that, new cells were thawed from stocks containing 13106 cells in 1 ml of growth medium, supplemented with 10% DMSO (Sigma) and 50% FBS. HEK cell line and culture conditions HEK293 cells (DSMZ) were grown and maintained in DMEM (Thermo Fisher), supplemented with 10% fetal bovine serum (PAN- Biotech), 2 mM glutamine (Thermo Fisher) and 1% penicillin–streptomycin (Thermo Fisher). Cells were incubated at 37#C with 5% CO2. Subcultivation was performed with 1 ml of 0.05% trypsin every 2–3 days for up to 15 passages. Then, new cells were thawed from stocks containing 23106 cells in 1 ml of growth medium, supplemented with 10% DMSO (Sigma) and 90% FBS. Recombinant protein expression Proteins were expressed in E. coliBL21 (DE3) cells grown in LBmedium orM9minimal medium supplemented with 1 g/l 15NH4Cl and 2 g/l 13C-glucose (uniformly labeled) at 37#C. Protein expression was induced with 1.0 mM isopropyl b-D-1-thiogalactopyrano- side (IPTG). METHOD DETAILS Establishing FUBP1 KO/Nboxmut cell lines FUBP1wasmutated and knocked out using the CRISPR/Cas9 system in hTERT-RPE1 NatNeomono puro sens cells. This cell line is puromycin sensitive and expresses Streptococcus pyogenes Cas9 under neomycin resistance. For the creation of the FUBP1 KO and FUBP1-Nboxmut RPE1 cell lines, cells were cultured as described above with the addition of neomycin (G418, InvivoGen) to pre- serve Cas9 expression. Guide RNA (gRNA) was amplified from oligos #54 and #55 (Table S5) with Phusion Polymerase (New England Biolabs) and in vitro transcribed with TranscriptAid EnzymeMix (Thermo Fisher) according to themanufacturer’s protocol. Cells were then transfectedwith the resulting gRNA using Lipofectamine CRISPRMAX (Thermo Fisher) according to themanufacturer’s protocol and incubated for 48 h. To assess the general editing efficiency, a GeneArt Genomic Cleavage Detection Assay (Thermo Fisher) was performed. Edited cells were then sorted by fluorescence-activated cell sorting (FACS), and each cell was cultured in a separate well of a 96-well plate (Corning). From each clonal cell line, genomic DNA (gDNA) was isolated and amplified by PCR. The successful disruption of the targeted site was validated by enzyme restriction and Sanger sequencing (StarSEQ GmbH, Mainz, Germany) of the colonies. To obtain the novel sequence of the targeted site on both alleles, gDNA was also cloned into TOPO vectors using the Zero Blunt TOPO PCR Cloning Kit (Thermo Fisher), and the obtained plasmids were Sanger-sequenced. All Sanger sequencings were performed with oligo #56 (Table S5). The edited sequences led to mutated protein products, as shown in Figure S5G. Immunoblotting For each hTERT RPE1-derived cell line, 13106 cells were seeded on a 10-cm cell culture dish (Corning) and harvested after incuba- tion for 48 h at 37#C, 5% CO2. Cells were lysed in modified RIPA buffer containing 50 mM Tris-HCl, 150 mM NaCl, 1 mM EDTA, 1% NP-40 (Sigma), 0.1% sodium deoxycholate (Sigma) and supplemented with cOmplete Protease Inhibitor Mix (Sigma), and TURBO DNase (Thermo Fisher) for 15 min on ice. Cell debris was precipitated by centrifugation at 16,000 3g for 15 min at 4#C. The cleared protein lysate was transferred into a new reaction tube (Eppendorf) and the concentration was measured with a BCA Protein Assay Kit (Thermo Fisher). 20 mg of protein lysate was mixed with 43 NuPAGE LDS Sample Buffer and heated to 70#C for 10 min. Samples were loaded onto a NuPAGE 1 mm, 4–12% Bis-Tris Mini Protein Gel (Thermo Fisher) and electrophoresis was performed at 180 V, 400 mA for 50 min on a NuPAGE Novex Gel System (Invitrogen). Protein transfer to a nitrocellulose membrane (VWR International) was performed at 30 V, 400mA over 60min using the same gel system. Themembrane was blocked in 5%milk diluted in PBS-T. The primary antibody (key resources table) was incubated overnight at 4#C, and the secondary antibodywas incubated for 60min at room temperature. All antibodies were diluted in 5%milk–PBS-T. Between blocking and primary and secondary antibody steps, the mem- brane was washed three times with PBS-T. Detection was performed with SuperSignal West PICO Chemiluminescent Substrate (Thermo Fisher) and BioRad GelDoc (BioRad). RPE1 RNA-seq For RPE1 RNA sequencing (ID: imb_koenig_2020_12) and semi-quantitative RT-PCR analysis, RPE1 cells were grown as described above. Cells were washed once with DPBS and harvested with a cell scraper in l ml of DPBS. Suspensions were centrifuged at 1,000 3g for 1 min at 4#C. RNA was isolated from cell pellets using an RNeasy PLUS Mini Kit (Qiagen) according to the manufac- turer’s protocol. For sequencing, RNA concentration was measured by Qubit RNA BR Assay and integrity of the RNA was confirmed by Bioanalyzer RNA Nano Assay (Agilent). Ribosomal RNA was removed and the remaining RNA was reverse transcribed into cDNA using the TruSeq library preparation kit with Ribo-Zero Gold (Illumina). The libraries were sequenced on an Illumina NextSeq 500 sequencer as 159-nt single-end reads. e4 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 68 ll Article OPEN ACCESS HeLa RNA-seq 200,000 cells were seeded per well in a six-well dish 24 h prior to siRNA treatment. RNA-seq to assess intron splicing in HeLa cells (ID: imb_koenig_2018_18) was performed in four replicates. HeLa cells underwent a control knockdown (KD) with no-target siRNA. Oli- gos #40–#43 (Table S5) were delivered into cells using 3 ml of Lipofectamine RNAimax (Thermo Fisher) in 100 ml of OptiMEM to achieve a final siRNA concentration of 20 nM. Cells were harvested after incubation for 48 h. RNA was isolated from cell pellets using RNeasy PLUS Mini Kit (Qiagen) according to the manufacturer’s protocol. RNA concentration was measured by Qubit RNA BR Assay, and integrity of the RNA was confirmed by Bioanalyzer RNA Nano Assay (Agilent). Ribosomal RNA was removed and the re- maining RNA was reverse transcribed into cDNA using the TruSeq library preparation kit with Ribo-Zero Gold (Illumina). RNA-seq samples were sequenced on an Illumina NextSeq 500 sequencer as 84-nt single-end reads. Semi-quantitative RT-PCR The MPDZ minigene was created from HeLa gDNA extracts by amplification of chr9:13,183,353-13,189,041 with Phusion HighFidelity Polymerase (New England Biolabs). The PCR fragment was cloned into a pCR2.1 vector by Gibson assembly (IMB Pro- tein Production Core Facility). MPDZ introns were shortened using a Q5 Site-Directed Mutagenesis Kit (New England Biolabs), re- sulting in MPDZDintron, which lacks chr9:13,186,637-13,188,633 and chr9:13,183,736-13,186,120, MPDZDBS, lacking chr9:13,186,494-13,186,618 and chr9:13,183,632-13,186,718, and MPDZDintron+DBS, lacking chr9:13,186,494-13,188,633 and chr9:13,183,632-13,186,120 (Figure S6B). The open reading frames for GFP and the FUBP1 variants (FUBP1FL, FUBP1DN, FUBP1A38D, FUBP1DC, FUBP1W586,615R) used in the complementation assay were integrated in a pcDNA5 vector containing a CMV promoter and an N-terminal GFP tag, which was then used to transform DH5alpha cells (Invitrogen). All expression vectors and minigenes are described in Table S6. Plasmid purification was performed with the Qiaprep Spin Miniprep Kit (Qiagen) or the Qiaprep Plasmid Plus Midi Kit (Qiagen). Sequences were verified by Sanger sequencing. All hTERT RPE1 cell lines were seeded, transfected, and harvested as described in the section "RPE1 cell culture". For complementation, an equimolar amount of expression vector andminigene was used. RNAwas isolated with the RNeasy Plus Mini Kit (Qiagen) and reverse transcribed using the RevertAid First Strand cDNA Synthesis Kit (Thermo Fisher). The minigene cDNA was then amplified using OneTaq DNA Polymerase according to the manufacturer’s protocol and oligos #57 and #58 as primers (Table S5). Splicing products were assessed on a High Sensitivity D1000 ScreenTape (Agilent) (Figure S6D). The percent spliced-in (PSI) value for the alternative exon was determined using the following formula: Inclusion / (Inclusion + Skipping). PSI values in the complementation experiment were normalized to the mean of the wild-type (WT) within each condition. Statistical significance was assessed by Student’s t-test and multiple testing correction was performed using the false discovery rate (FDR). In vivo iCLIP In vivo iCLIP was used to study protein–RNA interactions with individual nucleotide resolution.43 For the U2AF2 in vivo iCLIP study, data from two iCLIP experiments were combined. The first U2AF2 and PTBP1 in vivo iCLIP experiments were performed as previ- ously described.114 The secondU2AF2 in vivo iCLIP experiment aswell as in vivo iCLIP experiments on FUBP1, SF1, and SF3B1were performed using the iCLIP2 protocol as previously described.44 In brief, HeLa cells were irradiated (150 mJ/cm2) in a CL1000 UV crosslinker (UPV) to covalently bond the RNA-binding proteins to the bound nucleic acids. For in vivo iCLIP of FUBP1, crosslinking was achieved by 4-thiouridine (4sU)-mediated crosslinking (see section below). During subsequent cell lysis, the lysate was DNase- treated with TURBO DNase (Thermo Fisher) and RNA was partially digested to create 50–200-nt fragments. Immunoprecipitation of the investigated proteins was performed with antibodies listed in the key resources table. The anti-PTBP1 antibody was a kind gift from Christopher Smith.115 Radioactive labeling at the 30 end of the precipitated RNA enables visualization of the RNP complex by SDS-PAGE and transfer to a nitrocellulose membrane. After recovery of protein–RNA complexes from the membrane, proteinase K digestion resulted in protein-free RNA. cDNAwas synthesized by reverse transcription, which stops at the crosslinked site, leading to truncated reads in the sequencing. The cDNAwas cleaned twice using MyONE Silane beads (Thermo Fisher). PCR amplification and ProNex size selection were performed to amplify and purify the library, respectively. In vivo iCLIP libraries (except PTBP1 libraries) were sequenced on an Illumina NextSeq 500 sequencer as 92-nt single-end reads including a 6-nt (or 4-nt in the case of the first U2AF2 iCLIP) sample barcode as well as 5+4-nt (or 3+2-nt) unique molecular identifiers (UMIs). PTBP1 iCLIP libraries were sequenced on an Illumina GA-II machine116 and then re-sequenced on an Illumina HiSeq 2000 machine as 50-nt single-end reads including a 4-nt sample barcode and 3+2-nt UMIs. 4-thiouridine crosslinking of FUBP1 in vivo iCLIP For the FUBP1 in vivo iCLIP, HeLa cells were 4sU-labeled by adding 0.1M 4sU in DMSO to a final concentration of 100 mM in a 10-cm cell culture dish. Cells were incubated for 16 h at 37#C, 5% CO2, with the exclusion of light. After incubation, the cells were moved onto ice, shielded from light and irradiated at 365 nm, 800 mJ. Then, iCLIP was performed as described above. In vitro iCLIP In vitro iCLIP measures the intrinsic RNA-binding affinity of an RNA-binding protein (RBP).28 To that end, recombinant proteins and in vitro transcripts resembling long natural transcripts28 or a large-scale RNA pool transcribed from an oligonucleotide library were mixed and subjected to UV crosslinking and immunoprecipitation of the RBP of interest. Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e5 69 ll OPEN ACCESS Article Production of recombinant proteins N-terminally 6xHis-tagged U2AF2RRM12 was purified as previously described.28 In brief, a recombinant construct (Table S6) was ex- pressed in E. coli BL21-CodonPlus (DE3)-RIL cells (Agilent) for 3–4 h at 37#C using LB-Media and 1 mM IPTG. U2AF2RRM12 was pu- rified using Ni Sepharose 6 Fast Flow beads (GEHealthcare) according to themanufacturer’s protocol, and concentrated with Spin-X UF 500 5K MWCO columns (Corning) to a concentration of 1.156 mg/ml before being flash-frozen in liquid nitrogen and stored at "80#C. All three N-terminally 6xHis-tagged FUBP1 protein variants (FUBP1FL, FUBP1DN, FUBP1N74; Table S6) were expressed over- night at 16#C using LB media and 1 mM IPTG. Cells were lysed in lysis buffer (50 mM Tris-Cl, pH 8.0, 500 mM NaCl, 1 mM DTT, 5% glycerol, EDTA-free cOmplete protease inhibitor cocktail), using a CF1 Cell Disrupter (Constant Systems). Lysates were cleared by centrifugation (40,000 3g, 30 min, 4#C). Recombinant proteins were affinity-purified from cleared lysates using an NGC Quest Plus FPLC system (Biorad) and a HisTrap FF 5 ml column (Cytiva) according to the manufacturers’ protocols. Full-length FUBP1FL and FUBP1DN proteins were diluted 1:10 in heparin binding buffer (30 mM Na-HEPES, 20 mM NaCl, 5% glycerol, 1 mM DTT, pH 7.4), loaded onto a Heparin HP 5 ml column (Cytiva) and eluted over 15 column volumes using a linear gradient of 20–1000 mM NaCl in the heparin binding buffer. All FUBP1 variants were concentrated using Amicon 15 ml spin concentrators (Merck Millipore) and subjected to gel filtration (Superdex 200 16/60 pg in 30 mM Na-HEPES, 100 mM NaCl, 1 mM DTT, 5% glycerol, pH 7.4). Peak frac- tions containing the recombinant proteins after gel filtration were pooled, and protein concentration was determined by using absor- bance spectroscopy and the respective extinction coefficient at 280 nm, before aliquots were flash-frozen in liquid nitrogen and stored at "80#C. For the detailed workflow, log files can be requested from Dr. Julian König. Preparation of long in vitro transcripts Long in vitro transcripts were prepared as described in Sutandy et al.28 Minigene and spike-in RNAs were created by PCR amplifi- cation of DNA templates using Phusion High-Fidelity DNA Polymerase (New England Biolabs) according to the manufacturer’s pro- tocol. In vitro transcription of gel-purified PCRproducts was performed usingHiScribe T7High Yield RNASynthesis Kit (New England Biolabs) according to the manufacturer’s instructions. RNA was isolated with the RNeasy Plus Mini Kit (Qiagen), followed by DNA digestion with TURBO DNase and another RNA extraction. RNA quality was verified by capillary electrophoresis using High Sensi- tivity RNA ScreenTape (Agilent). RNA concentration was measured with a Qubit RNA HS Assay Kit (Thermo Fisher). Aliquots of equi- molar mixes of all minigenes as well as spike-in aliquots were stored at "80#C. In vitro iCLIP with long in vitro transcripts In vitro iCLIP with long in vitro transcripts (ID: imb_koenig_2018_01_sub16) was performed for U2AF2RRM12 alone or supplemented with different FUBP1 variants. The experiment was performed with a pool of eight in vitro transcripts (C4BPB, MPDZ, MYC, MYL6, NF1, TENT2, PCBP2, and PTBP2, see GEO record GSE220183) as previously described.28 The in vitro transcripts were preheated for 5 min at 70#C to minimize RNA secondary structure. Then, in vitro transcripts at a final concentration of 2 nM were added to 50 nM U2AF2RRM12 either alone (three replicates) or supplemented with either 50 nM FUBP1FL (two replicates), 50 nM FUBP1DN (two rep- licates), or 50 nM FUBP1N74 (two replicates). The mixtures were incubated at 37#C for 5 min before UV irradiation at 50 mJ/cm2. The in vitro iCLIP reaction was spiked with 10 ml of crosslinked mixture containing 250 nM U2AF2RRM12 and 6 nM NUP133 in vitro tran- script for normalization.28 Partial RNase digestion and DNase treatment, followed by the standard iCLIP protocol, were performed as described in the section "In vivo iCLIP". After reverse transcription, the cDNA was purified and libraries were generated according to the iCLIP2 protocol.44 Preparation of oligo-derived transcripts A total of 1,998 DNA oligonucleotides were chosen to represent 182-nt regions around 30 splice sites, including the last 132 nt up- stream of a 30 splice site and the first 50 nt of the downstream exon, preceded by 18 nt of T7 promoter sequence for the reverse transcription. The genomic coordinates of all regions represented in the oligonucleotide library are listed in GEO record GSE220183. The DNA oligonucleotides were purchased from TWIST Bioscience (South San Francisco, CA). Before in vitro transcrip- tion, L3 adapter ligation was performed. This was achieved by resuspending the DNA pellet in T4 RNA ligase (New England Biolabs) mix containing a 1:10 oligo/adapter ratio for high ligation efficiency. This mixture was reacted overnight at 16#C at 1300 rpm and then inactivated at 98#C for 5 min. L3-ligated DNA oligonucleotide (2.6 ng) was amplified using the Phusion High-Fidelity DNA Polymerase (New England Biolabs) according to the manufacturer’s protocol. Amplicons were purified twice using the ProNex Dual Size- Selective Purification System (Promega) with an optimized bead/library ratio of first 1.13 and then 0.5. Capillary electrophoresis with a High Sensitivity D1000 ScreenTape (Agilent) was used for quality control. Then, in vitro transcription was performed for 4 h at 37#C by following the HiScribe T7 (New England Biolabs) protocol for short transcripts. Subsequently, RNA was treated with TURBO DNase I and isolated using Qiagen’s protocol for "Total RNA containing small RNA from cells" (RNeasy Plus Mini Handbook, Appendix E) with the reagents mentioned above. in vitro iCLIP on oligo-derived transcripts For in vitro iCLIP with an oligonucleotide-derived RNA pool (ID: imb_koenig_2018_01_sub12), the oligonucleotide-derived transcript pool at a concentration of 50 nM was preheated for 5 min at 70#C and incubated with 50 nM U2AF2RRM12 alone or with either 50 or 300 nM FUBP1FL (three replicates each) for 10 min before UV irradiation at 50 mJ/cm2. iCLIP was performed as described in the sec- tion "In vivo iCLIP", omitting the partial RNase digestion and L3 linker ligation steps as they do not apply here. The reaction was spiked with a mix of 10 150-nt long spike-in oligonucleotides for normalization (oligos #44–#53; Table S5). e6 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 70 ll Article OPEN ACCESS Sequencing and data preprocessing In vitro iCLIP libraries were sequenced on an Illumina NextSeq 500 sequencer as 150-nt single-end reads including a 6-nt sample barcode as well as 5+4-nt UMIs. The reads were bioinformatically preprocessed as described for in vivo iCLIP samples. The number of uniquely mapped reads for all in vitro iCLIP samples are given in Table S1. Protein expression and purification All plasmids encoding sequences of FUBP1, U2AF2, chimeric U2AF2linker-RRM2/FUBP1N-box (linked by a 14 GS linker), SF1, SNRPA, SNRPB, and PRPF40B were cloned into the pETM11 vector or pET24 vector with a His tag, His-GB1 tag, or His-protein A tag, fol- lowed by a TEV cleavage site. The point mutants of FUBP1 were generated by site-directed mutagenesis. All constructs are listed in Table S6. Recombinant proteins were expressed in E. coli BL21 (DE3) cells in LB medium or M9 minimal medium supplemented with 1 g/l 15NH Cl and 2 g/l 134 C-glucose (uniformly labeled). After growth of the bacterial cells to an OD600 value of 0.8, protein expression was induced with 1.0 mM IPTG followed by overnight expression at 18#C. After resuspension in 50mMTris, pH 8.0, 500mMNaCl, 10mM imidazole (supplemented with lysozyme, 1 mg/ml DNase, 2 mMMgSO4, and protease inhibitor), the cells were lysed using a French press. Cleared lysates were added to Ni–NTA resin, washed with 2 M NaCl and eluted with 500 mM imidazole. The His tag was cleaved with His-tagged TEV protease at 4#C overnight. The protein was further purified by removing the cleaved His tag, uncleaved protein and TEV protease from the desired protein on a second Ni–NTA column. All proteins were further purified by ion-exchange chromatography on RESOURCE S or RESOURCE Q columns (Cytiva) (20 mM Tris, pH 8.0 or 20 mM sodium phosphate, pH 6.5, gradient from 0 to 1 M NaCl in 10 column volumes) followed by size-exclusion chromatography on a HiLoad 16/600 Superdex 75 column (GE Healthcare) (20 mM sodium phosphate, pH 6.5, 150 mM NaCl). NMR spectroscopy All NMR samples (13C15N- or 15N-labeled, as appropriate) were measured at concentrations of 0.1–1 mM in NMR buffer (20 mM so- dium phosphate, pH 6.5, 50 mMNaCl, 2 mMDTT) containing 10% (v/v) D O at 25#2 C on 900-, 800-, 600-, or 500-MHz Bruker Avance NMR spectrometers (cryogenic triple-resonance gradient probes). The NMR spectra were processed with TOPSPIN3.5 (Bruker) or NMRPipe93 and analyzed using NMRFAM-Sparky.94 Chemical shift assignment Protein backbone assignments were obtained from standard HNCA, HNCACB, CBCA(CO)NH, HNHA backbone experiments. Spe- cifically, for KH domains, the 1H–15NHSQC spectrum of KH1–4was first assigned, then corresponding assignments were transferred to the spectra of the individual and tandem KH domains. Further side-chain resonances were assigned using CC(CO)NH, HCC(CO) NH, hCCH-TOCSY and HcCH-TOCSY experiments. The distance restraints for structure calculations were obtained from 3D 15N- and 13C-edited NOESY–HSQC experiments.117,118 Secondary structure propensities were derived from the difference of Ca and Cb chemical shifts to the random coil shifts. 119–121 Relaxation experiments 15N-relaxation experiments were recorded on an 800 MHz Bruker Avance NMR spectrometer at 25#C and 15N T1 and T2 relaxation times were acquired from pseudo-3D HSQC experiments in an interleaved manner with eight relaxation delays for T1 (20, 60, 100, 200, 400, 600, 800, 1200 ms) and nine relaxation delays for T2 (16.96, 33.92, 67.84, 101.76, 135.68, 169.6, 254.4, 305.28, 339.2 ms).122 Residual relaxation rates were obtained by fitting the data to an exponential function using NMRFAM-Sparky.94 Titrations For NMR titrations, 1H–15N HSQC spectra were measured after each addition of titrant and the changes were visualized by calcu- lating the CSP.123 The KD values were calculated from NMR titrations by plotting the CSP of selected peaks (8) against the ligand concentration and fitting the data as previously described. Standard deviations of the mean were calculated from KD values of the 8 selected peaks.124 Structure calculation To stabilize the U2AF2 and FUBP1 interaction, a chimeric construct of U2AF2RRM2 and FUBP1N-box was introduced for the subse- quent structure determination (Table S6). Overall structural integrity of the chimeric construct and recapitulation of the interaction was confirmed by comparing 1H–15N HSQC spectra of the chimeric construct to that of the intermolecular complex U2AF2- RRM2–FUBP1N-box (Figures S3I and S3J). CYANA3 (3.98.15) was used for automated NOE assignments and initial structure calcu- lations.95 To overcome partial signal broadenings for the resonances at the interface of the two domains, possibly due to the weaker affinity, additional unambiguous intramolecular distance restraints from 13C-NOESY–HMQC and methyl-NOESY spectra were manually assigned and included in the structure calculation.125 A minimal number of typical hydrogen bonds, which were confirmed by 15N-edited NOESY and secondary structure propensity, was implemented to assist the initial folding during the structure calcu- lation. Dihedral angle restraints were derived from SSP and 13C secondary chemical shifts using TALOS+, including resonances of Ca, Cb, C, H, and N.96,126 For water refinement, distance restraints from CYANA3 considering an error of ± 0.5 Å are used. Water refinement127 of the 20 lowest-energy structures (500 initial structures) was performed with ARIA2.397 and CNS.128 The quality of the 10 final structures was evaluated by ProcheckNMR98 and PSVS.99 Ensemble structure root mean square (r.m.s.) deviations were calculated using MolMol100 and the ribbon representations were prepared in PyMOL (The PyMOLMolecular Graphics System, version 1.8.6.0, Schrödinger, LLC). Structural statistics are shown in Table 1. Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e7 71 ll OPEN ACCESS Article Scaffold-independent analysis For the initial screening, the 16 DNA pools of 5-mer DNA (Table S5, #63, IDT), instead of RNA due to their similarity in binding, were generated by introducing a specific nucleotide at a designated position while randomizing the other four positions. Titrations of 100 mM FUBP1 KH domain samples with the different DNA pools (0.5, 1.0, 2.0, and 4.0 molar equivalents of titrant to analyte) were performed at 25#C in NMR buffer (20 mM sodium phosphate, pH 6.5, 50 mM NaCl, 2 mM DTT) containing 10% (v/v) D2O by recording SOFAST HMQC spectra on a 600 MHz Bruker Avance NMR spectrometer (cryogenic triple-resonance gradient probe). For the comparison and identification of position-specific nucleotide preference, we focused on a subset of 12 representative peaks, which show visibly clear changes in chemical shift (fast-exchange regime) and are therefore involved in binding, for further analysis. CSPs of these peaks were calculated (see above) and the average CSPs of all peaks for each pool were normalized against the largest CSP calculated in the four pools to obtain a score for nucleotide preference at a specific position. The final optimized motifs were verified by comparing the chemical shift changes upon adding either DNA or RNA for all KH domains (Table S5, #67–72).129 In vitro binding assays In vitro transcription All RNA samples were in vitro transcribed using T7 RNA polymerase, precipitated by ethanol and purified by denaturing PAGE (12% polyacrylamide gel containing 8 M urea). The DNA templates for in vitro transcription are shown in Table S5 (Oligos #59–62). The gel slices were electro-eluted at 250 V in 0.53 TBE. To promote proper folding, the RNA samples were heated to 95#C for 2 min and subsequently snap-cooled on ice before use. Fluorescent EMSA In vitro-transcribed RNA was fluorescently labeled by ligation of pCp-Cy5 to the 3’ end of the RNA with T4 RNA ligase 2. Subse- quently, the reaction was purified using a spin column kit (Norgen Biotek Corp.). For binding studies, 100 nM labeled RNA in 20mM sodium phosphate, pH 6.5, 50mMNaCl and glycerol (15% final concentration) was incubated with increasing concentrations of FUBP1N-box+KH1-4 (amino acids 1–457) for 15 min. Mixtures were loaded onto a 0.7% agarose gel. Gel electrophoresis was per- formed in 13 TBE buffer at 40 V for 4 h. Detection was performed using a Typhoon 9200 (GE Healthcare Life Sciences) at 649 nm. Data analysis was performed in Image J 2.1.0.102 Experiments were repeated to estimate the standard deviation of the mean. Isothermal titration calorimetry ITC experiments were performed on aMicroCalPEAQ-ITC instrument (Malvern Panalytical) using non-isotopically labeled proteins as analyte sample and titrant or non-isotopically labeled protein as analyte and DNA oligonucleotides as titrant in NMR buffer at 25#C. U2AF2 constructs (concentration 15–30 mM) were titrated with FUBP1 N-terminal constructs (concentration 1.5–3.0 mM); FUBP1 double-KH domain constructs (concentration 20–30 mM) were titrated with DNA oligonucleotides (concentration 200–350 mM, Table S5, #64–66); in vitro-transcribed ssRNA (VPS13D, 15 mM) was titrated with FUBP1KH (150 mM). Binding affinity analysis was performed using MicroCalPEAQ-ITC Analysis Software (Malvern Panalytical). The standard deviations of the KD values were esti- mated based on the differences in triplicate measurements. BRET BRET plasmid construction The donor and acceptor vectors pcDNA3.1-cmyc-NL-GW (Addgene plasmid ID #113446), pcDNA3.1-GW-NL-cmyc (Addgene plasmid ID #113447), pcDNA3.1-GW-mCit, pcDNA3.1-mCit-GW, as well as controls pcDNA3.1-NL-cmyc (Addgene plasmid ID #113442), pcDNA3.1-PA-mCit (Addgene plasmid ID #113443), and pcDNA3.1-PA-mCit-NL-cmyc (Addgene plasmid ID #113444) were kindly provided by theWanker group (Max-Delbru€ck-Centrum fu€r Molekulare Medizin, Germany). The GATEWAY entry vectors pDON221 and pDON223 were provided by the Vidal group (Dana Farber Cancer Institute, Boston, MA). All vectors were amplified and full-length sequenced using the primers given in Table S5. Full-length wild-type ORFs being cloned into GATEWAY entry vectors were amplified from a human ORFeome collection.130 The ORFs were full-length sequenced using primers shown in Table S5. ORFs of FUBP1, SNRNP70, and TCERG1 (Table S6) were PCR-amplified with primers #9–10, #27–28, and #33–34, respectively (Table S5) and shuttled into pDON223 using a BP clonase II mix kit (Invitrogen). The Q5 site-directed mutagenesis kit (Invitrogen) was used to produce the following mutants: pDON223-FUBP1_A38D, pDON223-FUPB1_W586R_W615R, and pDON223-FUBP1_1-530aa (Table S6). For BRET experiments, all cDNAswere shuttled from the entry vectors into the BRET destination vectors using LR clonase technology (Invitrogen) according to the manufacturer’s protocol. After the LR cloning step, the inserts were partially sequence- confirmed. All primers used are given in Table S5 and all the constructs are listed in Table S6. Transfection The human embryonic kidney 293 cells were transfected using Lipofectamine 2000 (Invitrogen) transfection reagent in Opti-MEM medium (Thermo Fisher) using the reverse transfectionmethod according to themanufacturer’s instructions. For BRET transfections, cells were seeded at a density of 4.03104 cells per well on a white 96-well microtiter plate (Greiner) in phenol-red-free, high-glucose DMEM media (Thermo Fisher) supplemented with 5% FBS (Thermo Fisher). Transfections were performed with a total amount of 200 ng of DNA per well. If the amount of expression plasmid was less than 200 ng in a well, pcDNA3.1 (+) was used as a carrier DNA to achieve the total of 200 ng. e8 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 72 ll Article OPEN ACCESS Experiments Cells were transfected with plasmids encoding the acceptor (50 ng DNA) and donor (1 ng DNA). The plate was incubated for 2 days at 37#C, 5% CO2, and 85% relative humidity prior to measurement. All measurements were performed on an Infinite M200 Pro micro- plate reader (Tecan). First, 100 ml of the medium was aspirated from each well. The mCitrine fluorescence was measured in intact cells (excitation/emission 513/548 nm). Then, coelenterazine h (PJK Biotech GmbH) was added at a final concentration of 5 mM. The cells were briefly shaken and incubated for 15 min inside the plate reader. After incubation, total luminescence was measured first followed by short-wavelength and long-wavelength luminescence measurements using BLUE1 (370–480 nm) and GREEN1 (520–570 nm) filters at 1,000 ms integration time. Corrected BRET (cBRET) ratios were calculated as previously described.58 In brief, for every transfected protein pair NL-A and mCit-B, the following two control pairs were measured: NL-Stop with mCit-B and NL-A with mCit-Stop. The maximal BRET from both control pairs was subtracted from the actual test pair to correct for donor bleed- through, nonspecific binding to the tags, and background signal. Saturation assay For donor saturation experiments 1 ng of donor DNA encoding NL-fused proteins was co-transfected with increasing amounts of acceptor DNA encoding mCitrine-fused proteins (10, 25, 50, 100, 200, 400 ng). Fluorescence, total luminescence, and BRET were measured as described before. BRET measurements were corrected for bleed-through using NL-Stop transfections. Fluores- cence and total luminescence measurements were used to estimate the amount of expressed proteins and used to plot acceptor/ donor ratios on the x-axis. QUANTIFICATION AND STATISTICAL ANALYSIS Preprocessing of RNA-seq data Prior to genomic mapping, remaining adapter sequences were trimmed in RNA-seq data from FUBP1 KO, FUBP1-Nboxmut, and WT control RPE1 cells using Cutadapt v2.4.108 A minimal overlap of 1 nt between reads and adapter was required and only reads with a length of at least 50 nt after trimming were retained for further analysis (parameters: -O 1 -m 50). Reads were mapped using STAR v2.6.1b,107 allowing up to 4% of the mapped bases to be mismatched (--outFilterMismatchNoverLmax 0.04 --outFilterMismatchNmax 999) and a splice junction overhang (--sjdbOverhang) of 83 nt for HeLa WT samples and of 158 nt for FUBP1 KO, FUBP1-Nboxmut, and WT control RPE1 cells. Genome assembly and annotation of GENCODE131 release 31 were used during mapping. Subsequently, secondary hits were removed using Samtools v1.9.109 Exonic read counts per gene were ex- tracted using featureCounts from the Subread tool suite v1.6.2110 with non-default parameters --donotsort -s2. Preprocessing of in vivo iCLIP data Basic quality controls were conducted in FastQC v0.11.8 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) and reads were filtered based on sequencing qualities (Phred score) in the sample barcode and UMI regions using the FASTX-Toolkit v0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit/) and seqtk v1.3 (https://github.com/lh3/seqtk/). All reads with a Phred score below 10 in the sample barcode or UMI regions were discarded. Reads were de-multiplexed based on the sample barcode, which is found on positions 6–11 of the reads (for 6-nt sample barcodes) or on positions 4–7 (for a 4-nt sample barcode), using Flexbar v3.4.0.111 Subsequently, barcode regions and adapter sequences were trimmed from read ends using Flexbar, requiring a minimal overlap of 1 nt of read and adapter and adding UMIs to the read identifiers. Reads shorter than 15 nt were discarded. All empty space and slash characters were removed from read identifiers in FASTQ files to prevent all information following thembeing lost duringmapping. The downstream analysis was done as described in Chapters 3.4, 4.1, and 4.2 of Ref. 132. Genome assembly and annotation of GENCODE131 release 31 were used during mapping with STAR v2.6.1b.107 The number of crosslinking events and peaks is given in Table S1. To assess the genomic distribution of iCLIP crosslink nucleotides, we used the following hierarchy: ncRNA > 30 UTR > 50 UTR > coding sequence (CDS) > 30 splice site > 50 splice site > intron > intergenic (Figure 1B). 30 and 50 splice site regions refer to 100 nt upstream/downstream. All other "deep-intronic" regions are called intronic regions. Metaprofiles for in vivo iCLIP data Four RNA-seq replicates from HeLa cells (imb_koenig_2018_18) served as the source for the identification of spliced introns. Map- ping to the genome was performed in STAR v2.6.1b107 (Table S1). Coordinates and number of unique supporting junction reads ("ureads") of spliced introns were extracted from the SJ file output by STAR containing high-confidence splice junctions. In the following, introns from the SJ file are called "SJ introns". SJ introns had to meet a reproducibility criterion (at least 3 out of 4 repli- cates). In addition, all overlapping SJ introns were removed. Finally, introns were overlaid with GENCODE release 31 annotation and filtered for level < 3, transcript support level < 4, and gene_type and transcript_type equal to "protein coding". This resulted in 88,375 SJ introns. Branch point (BP) prediction was taken from LaBranchoR.133 LaBranchoR is based on hg19, liftOver to hg38 was done with the liftOver tool by UCSC.134 The median distance of BP to 30 splice sites was 25 nt. 88,008 out of 88,375 SJ introns had an annotated BP. Introns were further filtered for a minimum length of 100 nt and a maximum length of 17,000 nt. Metaprofiles were aligned at the BP. In vivo iCLIP replicates for each RBP were summed up and a signal threshold of 10 in the metaprofile region ("200 nt to +50 nt with respect to the BP) was imposed. Crosslinking signals per intron were normalized by "ureads" and averaged per nucleotide over all introns. For display, the normalized signal was smoothed with a Gaussian window function and window size Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e9 73 ll OPEN ACCESS Article 10. Binding enrichment for RNAmaps stratified by intron length and splice site features was calculated by taking the log2 fold change of the ratio of the area under the curve (AUC) of each feature bin to the AUC in the shortest intron class or class with theweakest splice site feature. The following regions were used for AUC quantification, always with respect to BP: ["100,"25] for FUBP1, [+5, +25] for U2AF2, ["10, +10] for SF1, and ["30, "10] for SF3B1. The minimum signal in each region served as a background proxy and was taken as the lower horizontal boundary in which the AUC was calculated. For RNA maps stratified by GC content, the average GC content in the exonwas contrasted to the averageGCcontent in the first 100 nt of the downstream intron. Signal values for RNAmaps aligned at 50 splice sites were not smoothed but normalized by the average signal in the first 100 nt of the intron. RNA maps condi- tioned on exon rank: annotation of exons and downstream introns was extracted fromGENCODE release 31. BPs were annotated as described above. SJ introns were matched to introns. Duplicated matches were resolved such that the intron with the shortest up- stream exonwas taken. Five exon rank classes were extracted: 1st exon, exon ranks in [2,5), [5,12), [12, 144] and second to last exon. In comparison to all other RNAmaps, crosslinking signals per intron were normalized to the total crosslinking signal in the last 100 nt upstream of the 30 splice site. "ureads" correlates with exon rank and was thus not suitable as a normalization factor. RNA maps conditioned on exon GC content: upstream exons were identified as for exon rank RNA maps. Total exon GC content over exon length was extracted. Bins are as follows: [0.07,0.41], (0.41, 0.46], (0.46, 0.53], (0.53, 0.6], (0.6, 0.91]. RNA maps condition on intron GC content: total intron GC content over intron length was extracted. Bins were as follows: [0.14, 0.36], (0.36, 0.4], (0.4, 0.46], (0.46, 0.55], (0.55, 0.9]. RNA maps for fixed intron length/differential GC content architecture followed by subsequent conditioning on dif- ferential GC content/intron length. Here, RNA binding profiles were first stratified on one class of intron length/differential GC content architecture, followed by stratification on all levels of the other factor. Binding for all RNA maps was quantified based on AUC as described above. Analyses were performed in R v4.1.1.104 iCLIP binding site definition (peak calling) Binding site definition for in vivo iCLIP was done with PureCLIP v1.3.1. on merged replicates.112 PureCLIP was issued with the op- tions -iv ’chr1;chr2;chr3;’ -ld -nt 4. The crosslink sites identified by PureCLIP were post-processed as previously described.132 In detail, individual crosslink sites within a distance of 5 nt were clustered into binding regions. The binding regions were resized to obtain binding sites of a uniform width. To compare binding sites of different RBPs, we opted for 5-nt binding sites (i.e., 2 nt on either side of the position with themaximum signal) for all of the RBPs investigated (FUBP1, U2AF2, SF3B1, SF1, PTBP1). Isolated crosslink sites and binding regions of 2 nt were removed. Binding regions % 5 nt were centered on the position with the maximum crosslink signal and extended by 2 nt on either side. Binding regions > 5 nt were divided into regions of 5 nt, by iteratively screening for the maximum signal and extending of 2 nt on either side, excluding an overlap between binding regions. Finally, at least three positions with crosslink events were required to only keep binding sites with sufficient support. To ensure sufficient support of binding sites in the individual replicates of the experiment, a reproducibility filter was applied. In order to consider the varying number and size of replicates for each experiment, we filtered for those binding sites with a total number of crosslink events higher than the 10%percen- tile of the distribution of crosslink counts in the single replicate. In addition, aminimumof two crosslink events was required if the 10% percentile in the replicate was below this threshold. This was required in at least two out of three, three out of four and three out of five replicates depending on the number of replicates available for the respective experiment. The numbers of called binding sites per protein are given in Table S1. Saturation analysis Spliced introns were identified from four RNA-seq replicates in HeLa cells (imb_koenig_2018_18) as described above. Introns were retained if they were longer than 200 nt, and if the 50 splice site windows (the last 50 nt of the exon plus the first 75 nt of the intron) and 30 splice site windows (the last 200 nt of the intron plus the first 20 nt of the exon) were not overlapping. 30 splice sites overlapping to noncoding and long noncoding RNAs were excluded, resulting in 98,328 30 splice sites. These splice sites were binned into percen- tiles, based on "ureads" (splice site usage) averaged over replicates. RBP binding sites were assigned to curated 30 splice sites (the last 200 nt of the intron), requiring full overlap. For each bin, the percentage of 30 splice sites with at least one binding site for the specific RBP was calculated (Figure 1E). Motif enrichment for in vivo iCLIP Introns were defined based on GENCODE annotation (release 31). Annotation was filtered for level < 3, transcript support level < 4, and gene_type and transcript_type equal to "protein coding", resulting in 202,623 introns. BP annotation was done as specified above. 200,199 out of 202,623 had an annotated BP. Introns were further filtered for overlaps and for having a length of at least 250 nt upstream of the defined BP. The length requirement was set to ensure that the main position of FUBP1 binding was not confounded with the 50 splice site signal. FUBP1 binding sites (n = 854,404) were filtered for positioning within a 150-nt window up- stream of the BP, resulting in 167,408 binding sites. Binding sites were ranked by their normalized signal, that is, the signal in the extended binding site (5 nt ± 5 nt) over total intron signal over intron length. Disjunct 4-mer frequencies were counted in the top/bot- tom 20% binding sites based on normalized signal to account for overall crosslinking preferences. Additionally, non-bound intronic regions in introns hosting the top 20% FUBP1 binding sites were also considered as an alternative background set. Here, disjunct 4-mer frequencies were calculated for all non-bound intronic regions, excluding a 20-nt region downstream of the 50 splice site and a 150-nt region upstream of the BP. Enrichment was defined as the distance from each data point to the diagonal in a scatterplot e10 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 74 ll Article OPEN ACCESS comparing the top 20% versus bottom 20% binding sites and, alternatively, non-bound intronic sequences. Analyses were per- formed in R v4.1.1.104 Motif enrichment upstream of branch points Introns were extracted and BP annotated as above (200,199 introns left). Introns were further filtered for a minimum length of 500 nt and disjunct 200-nt windows upstream of the BP, resulting in 151,836 introns. Disjunct 4-mer frequencies were calculated in a po- sition-wise manner in a 200-nt window upstream of the BP. Average background motif frequencies were calculated in a 100-nt long window 100 nt downstream of the 50 splice site. Enrichment was defined as the distance from each data point to the diagonal in the scatterplot of position-wise frequencies versus average background frequencies. Abundance of FUBP1 motif at 30 splice sites Disjunct motif occurrences were counted in a 75-nt long window 25 nt upstream of the BP. The background distribution was derived as the occurrences of nine randomly drawn motifs of length 4, repeated 100 times. Analysis of in vitro iCLIP data All samples weremerged for binding site definition (peak calling) across replicates and conditions. Each in vitro transcript was divided into 9-nt windows, always shifted by one nucleotide. Windows were sorted by total signal and, while excluding overlapping peaks, generating a candidate. A negative binomial distribution was fit (maximum likelihood fit) to the signals on the candidate peak list. All peaks with a total signal exceeding the 90% quantile of the theoretical distribution were retained for final processing (109 peaks, see GEO record GSE220183). The background ranges were the in vitro transcript regions minus extended peaks (9 nt ± 5 nt). For quan- tifying the binding differences between conditions, replicates were averaged. Peak signals were normalized against background signals. RNA maps were based on 21 30 splice sites present in the in vitro transcripts. To correct for differences in expression, nucleotide-wise signals were normalized by total in vitro transcript signals. Subsequently, signals were summarized per nucleotide by the 75%quantile. Replicates were averaged and subjected to Gaussianwindow smoothingwith window size 10 before display. All analyses were performed in R v4.1.1.104 Analysis of oligo in vitro iCLIP All data was normalized according to the total signal of all available spike-ins. Values were then extracted either per nucleotide or by binding site. Binding site positions were taken from overlays with in vivoU2AF2 binding sites in the intronic part of the oligonucleotide. 1,831 oligonucleotides harbored an U2AF2 binding site in the intronic part (see GEO record GSE220183). If multiple binding sites were present, that with the highest average signal in the U2AF2 samples was taken as representative. For quantifying the addition of FUBP1 on U2AF2 binding sites, only those binding sites with signal greater than the 25% quantile in one of the three replicates were considered, resulting in 1,504 binding sites. The absolute number of disjunct occurrences of the FUBP1 motif set ("TTTT" and all combinations of "TTT" and either one "A" or one "G") was counted in a 75-nt long region located 25 nt upstream of the BP. All analyses were performed in R v4.1.1.104 Intron length analyses of RNA-seq data Splicing changes of FUBP1 KO and FUBP1-Nboxmut were analyzed with MAJIQ v2.2135,136 with default parameter settings. MAJIQ outputs local splice variations (LSV), which were filtered as follows: for each LSV, the top two junctions in terms of absolute difference in junction usage (delta percent selected index, |DPSI|) were taken as representative LSVs. At least one of these two junctions needed to have an absolute DPSI > 0.1 and a detection probability > 0.9 (skipped for control events). Subsequently, events were filtered for exon-skipping events. Each cassette exon was then annotated with the upstream and downstream intron: genomic coordinates of the upstream/downstream intron were immediately defined in "source"/"target" events. The genomic coordinates of the respective other intron were extracted from annotation (GENCODE release 31). Overlapping cassette exons were resolved such that the event with the largest |DPSI| was retained (Table S3). A two-tailed Wilcoxon rank-sum test was used to assess statistical significance. ENCODE data analysis We retrieved raw RNA-seq data derived from an shRNA-knockdown experiment for FUBP1 in the cell line K562 from the ENCODE data portal (https://www.encodeproject.org/), using accession numbers ENCSR608IXR (FUBP1 KD) and ENCSR260BQC (control). Alignment was performed in STAR (version 2.7.8a)107 with standard ENCODE options. We applied MAJIQ v2.3135,136 to identify and quantify cassette exons in the RNA-seq data. First, a splice graphwas built on the BAMfiles and theGENCODE gene annotation (v38, human genome version hg38). Then, the difference in junction usage between knockdown and control samples was calculated (as DPSI). Next, alternative splicing events such as cassette exons (CEs) were categorized and quantified in the splicing graph using MAJIQ Modulizer. Probabilities were calculated for each junction, testing for |DPSI| > 0.05 (probability changing [Ps]) and | DPSI| < 0.02 (probability non-changing [Pn]). TheMAJIQModulizer output was then processed in R, filtering for significantly regulated CEs and a control groupwith unregulatedCEs. ACE is defined as significantly regulated if |DPSI|R 0.055 for all junctions, PsR 0.9 for at least one junction pair (inclusion junction + skipping junction), the sign within both junction pairs is inverse, and within the junction pairs the lower |DPSI| is at least 50% of the higher |DPSI|. A CE is considered to be unregulated if Pn R 0.5 and |DPSI| % 0.02 for all junctions. Overall, this resulted in a total of 173 significantly regulated CEs and a control group with 1,910 unregulated CEs for further Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e11 75 ll OPEN ACCESS Article analysis. To categorize CEs into more included and less included, a representative DPSI was chosen for each CE based on the maximum |DPSI| of both inclusion junctions. Based on this, there were 30 more-included and 143 less-included exons. Splicing changes upon FUBP1 LoF mutations Significant differentially spliced exon-skipping events upon (i) loss-of-function (LoF) mutations of FUBP1 in low-grade gliomas (37 events), (ii) in FUBP1 siRNA knockdown in U87MG cells (109 events) and (iii) LoF mutations of other splicing factors (433) were extracted from Seiler et al.1 Junction lengths comprise the upstream intron, the skipped exon and the downstream intron. A two-tailed Wilcoxon rank sum test was used to assess statistical significance. Mutations in FUBP1 in cancer patients We searched multiple databases to identify disease-related mutations within the FUBP1 gene. We focused on the minimal binding interface to U2AF2 (FUBP1 amino acids 25–56) to find mutations that potentially abolish the interaction with the U2AF2 RRM2 domain. The following databases were used: ICGC Data Portal,137 cBioPortal,138,139 Exac,140 Cosmic,141 GDC Data Portal,142 gno- mAD,140 and ClinVar.143 All cancer-related mutations in FUBP1 in the observed region and the underlying cancer type are listed in Figure S4B. Scoring of splice site features 30 and 50 splice site strength was scored with MaxEnt scan.144 Py tract strength was determined as follows: a 39-nt region upstream of the AG dinucleotide at the 3’ splice site was screened with sliding windows of increasing length (width 5–30 nt) to identify the win- dow with the highest Py tract strength. The Py tract strength of each window was calculated as the X2 test statistic with 1 degree of freedom, comparing the observed number of pyrimidineswith the expected number based on the assumption of a uniform nucleotide distribution. In addition, candidate Py tracts were required to end within 10 nt upstream of the AG dinucleotide. Using this approach, themedian length of identified Py tracts was 16 nt. BP strength was assessed according to the U2 binding energy, that is, the number of hydrogen bonds between the candidate sequences and the BP binding sequences in the U2 snRNA. Hydrogen bonds form be- tween A:T (2 bonds), G:C (2 bonds), and G:U (1 bond; in fact also 2 bonds, but punished for being a wobble base pair) with the BP nucleotide bulging out and being omitted from the pairings. The Vienna RNA package v2.4.17113 (RNAduplex) was used to determine the optimal hybridization structure between U2 snRNA sequences (GUGUAGUA) and the motif (position "5 to +3, excluding the BP nucleotide). Predicted binding energy was the determined sum of hydrogen bonds forming between complementary motifs and U2 snRNA nucleotides. Evolutionary analyses We annotated the domain architecture of FUBP1 using the function annoFAS provided in the FAS package106 (https://github.com/ BIONF/FAS). The domain architecture-aware phylogenetic profile of FUBP1 across 174mammals, 274 non-mammalian vertebrates, 277 invertebrates, 410 fungal species, 94 protozoa, and 145 plants was generated with the targeted ortholog search tool fDOG (https://github.com/BIONF/fDOG)145 using the human FUBP1 (UniProt: Q96AE4) as a seed. fDOG was run with the options --minDist class, --maxDist phylum, –checkCoorthologsRef, and --countercheck. Homo sapiens (GenBank: GCF000001405) served as the reference taxon. Intron length and GC content information was extracted based on the respective gff and fasta files down- loaded from NCBI RefSeq Genome. Intron length estimates and motif searches were performed in R v4.0.5. A/B box presence in the human proteome was determined as follows: in brief, we used the shell command grep to search for the regular expression "[ST][AK][QA]W..YY[RK]" in 19,519 human proteins encoded in the NCBI RefSeq Genome assembly GCF_000001405.39. The result- ing three hits were NCBI: XP_011540693.1 (FUBP1, 2 motif instances), NCBI: NP_003925.1 (FUBP3, 1 motif instance), and NCBI: NP_001353228.1 (KHSRP, 3 motif instances). For counting FUBP1 motif occurrences across species, intron definitions were ex- tracted for all the species investigated and motifs were counted in a 25-nt window located 25 nt upstream of the 30 splice site. Analysis of RBP crosslinking to snRNAs In vivo iCLIP data from FUBP1, U2AF, SF1, SF3B1, and PTBwas remapped to a custom database consisting of snRNAs, tRNAs, and rRNAs using STAR v2.7.3a.107 Specifically, RNU1-1, RNU2-1, RNU4-1, RNU6-1, RNU5D-1, RNU7-1, RNU11, RNU12, RNU4ATAC, and RNU6ATAC were included. tRNA coordinates were retrieved from GtRNAdb (data release 19). "hg38-tRNAs.fasta", containing 429 high-confidence tRNA annotations, was downloaded. Because tRNAs are quite similar when stratified on their carried amino acid, one representative tRNA was selected per amino acid (tRNA with "1-1" in the name). In summary, this resulted in 22 tRNAs. Finally, the following rRNAs were added: 12S_gi, 16S_gi, 18S_gi, 28S_gi, 5.8S_gi, and 5S_gi. Mapping steps were performed as fol- lows: all sequences were furnished with one additional base upstream of the sequence with the rationale of being able to display iCLIP coverage of reads starting directly at the 5’ end of the sequence. tRNAs and snRNAs were furnished with the actual base up- stream of the sequence. rRNAs were furnished with an "N". Reads were mapped per replicate with STAR v2.7.3a using the settings described above for in vivo iCLIP samples. Few reads were mapped to the minus strand and thus removed. Uniquely mapping reads were subjected to duplicate removal based on identical UMIs (--method unique) using UMI-tools v1.0.0.146 Based on the remaining reads, iCLIP coverage profiles were exported aswell as count tables containing the number of reads overlapping the genomic ranges of the defined RNAs. e12 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 76 ll Article OPEN ACCESS Subnuclear distribution of FUBP1-bound genes The subnuclear spatial distribution for introns in HeLa cells was taken from Tammer et al.64, in which Chrom3D, a 3D genome- modeling tool that integrates 3DHi-C data and ChIP-seq data was used to assign distances from the nuclear center for topologically associated domains. The distance from the nuclear center is described by five concentric radial scopes where 1-to-5 point to the center–periphery axis. Our in vivo iCLIP data from SF3B1, FUBP1, and U2AF2 was then overlaid with the reported introns and the percentage of bound introns was counted. Enrichment was calculated as the percentage of bound introns in each radial scope compared to the first. Mathematical modeling Topology of the exon definition model Splicing reactions are catalyzed by the spliceosome, which recognizes splice site sequences and forms a catalytically active higher- order complex across introns. To model this process, we considered that human spliceosomes frequently operate by a so-called "exon definition" mechanism, in which the pioneering spliceosome subunits U1 and U2 cooperatively bind to splice sites flanking an exon before the final cross-intron complex is formed during spliceosome maturation.86 Because the initial binding of U1 and U2 plays a decisive role in splicing decisions,86 we model only the initial exon definition step and assume the corresponding binding patterns determine splicing outcomes, as described below. In the model pre-mRNA, none of the three exons are bound ("defined") by the spliceosome (white boxes), therefore this state is denoted "P0_0_0" (Figure S7F) with the notation "_" indicating the presence of an intron. In the model, the pre-mRNA (P0_0_0) is synthesized at a constant rate s. The spliceosome can bind reversibly to each of the exons with on-rates k1, k2, and k3. For instance, from P0_0_0 we can obtain P1_0_0, P0_1_0, and P0_0_1 through binding to the first, second, and third exon, respectively. Subse- quent binding is possible; for example, P1_0_1 can be generated from P1_0_0 with the rate constant k3. In total, there are eight spli- ceosomal binding states, including the fully bound state (P1_1_1), in which all exons are defined. All binding reactions are assumed to be reversible, i.e., k4, k5, and k6 are the dissociation rate constants and the reverse of k1, k2, and k3, respectively. For example, in state P1_1_0, spliceosome dissociation from exon 1 with the rate constant k4 yields the species P0_1_0. Depending on the exon definition states, splicing decisions aremade, and irreversible splicing reactions are possible. For a splicing event to occur, we consider that both exons flanking a future splice junction must be defined. For instance, skipping of exon 2 is possible from P1_0_1 and occurs with the rate constant i12. Likewise, splicing of the first intron occurs from the species P1_1_0 and P1_1_1 (rate constant i1), and splicing of the second intron from P0_1_1 and P1_1_1 (rate constant i2). The inclusion isoform is generated in two steps, that is, from the subsequent removal of introns 1 and 2 in random order: from the binding state P1_1_1, intron splicing generates two alternative intermediates in which either of the introns is already spliced (P1_11 or P11_1) and the retained intron can be further spliced in a subsequent reaction. Splicing of the partially defined species P1_1_0 and P0_1_1 yields the species P11_0 and P0_11; in these, the spliceosome can further reversibly bind exons 3 and 1, respectively, and undergo a second splicing reaction toward inclusion. In the model, all terminal splice products are subject to degradation (kincl, degradation rate constant of inclusion; kskip, skipping; kdr1, first intron retention; kdr2, second intron retention). The degradation rate constant of the full intron retention isoform is the sum of kdr1 and kdr2, reflecting that either intron may contain a destabilizing prema- ture stop codon. Model species that can be bound or spliced further (P0_0_0, P1_0_0, P0_1_0, P0_0_1, P1_1_0, P1_0_1, P0_1_1, P1_1_1, P0_11, P1_11, P11_0, P11_1) are not subject to degradation, but they can be exported from the nucleus with the rate con- stant kret. This reaction reflects that there is a limited time window for splicing to occur, the intermediates otherwise being terminally frozen in the corresponding intron retention state. The ordinary differential equations of the model are given in Table S4. Topology of the intron definition model Because a subset of human genes are spliced by an intron definition mechanism, we also considered this scenario in a modified version of our splicing model. In contrast to the exon definition model, the 50 and 30 splice sites of an exon can be bound indepen- dently of one another in the intron definition model. Furthermore, splicing of an intron is possible as soon as both splice sites flanking this intron are defined. Hence, definition of two splice sites is sufficient for splicing to occur, whereas in the exon definition model four splice sites need to be defined (30 and 50 splice sites of the two flanking exons). For the intron definition model, we use a notation for binding state similar to that for exon definition. For instance, for consistency, we assigned the state in which no spliceosome compo- nent is bound as P0_0_0. For spliceosome binding to exons 1 and 3, we again considered a single binding reaction, as only the splice sites flanking the intron of interest are relevant for splicing. Hence, a transition from "0" to "1" in the first position (e.g., P0_0_0 to P1_0_0) represents a spliceosome binding state downstream of exon 1 (5’ of the first intron), and "0" to "1" in the third position in- dicates binding upstream of exon 3 (3’ of the second intron). For exon 2, we treat splice-site binding as two separate events. We use "0" to denote no binding, "a" for upstream binding (e.g., P0_a_0), "b" for downstream binding (e.g., P0_b_0), and "1" for both U2 and U1 being simultaneously bound (e.g., P0_1_0). Again, the presence or absence of "_" indicates whether or not the intron is removed. We adopted the same parameter notation, that is, k1/k4 and k3/k6 to describe binding/dissociation at exons 1 and 3, respectively. The new parameters k2a/k5a (upstream) and k2b/k5b (downstream) were introduced to represent spliceosome binding/dissociation around exon 2. There are a total of 16 spliceosomal binding states in the intron definitionmodel, with the following additional states not part of the exon definition model: P0_a_0, P0_b_0, P1_a_0, P1_b_0, P0_a_1, P0_b_1, P1_a_1, and P1_b_1. If both splice sites flanking a future splice junction are defined, splicing decisions, implemented as irreversible splicing reactions in themodel, can occur. Skipping of exon 2 is possible from P1_0_1 and occurs with the rate i12. Splicing of the first intron occurs from species P1_a_0, P1_1_0, Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e13 77 ll OPEN ACCESS Article P1_a_1, and P1_1_1 (rate i1), and splicing of the second intron occurs from P0_b_1, P0_1_1, P1_b_1, and P1_1_1 (rate i2). The inclu- sion isoform is generated in two steps: first, intron 1 or 2 is spliced from P1_1_1, generating P1_11 or P11_1, respectively. Second, the retained intron can be further spliced in a subsequent reaction. Splicing of the partially defined species P1_a_0, P1_1_0, P0_b_1, and P0_1_1 yields the species P1a_0, P11_0, P0_b1, and P0_11, respectively. To these, the spliceosome can bind further reversibly with the association rate constants k1, k2a, k2, and k3 (depending on the site of binding), and if the species P1_11 or P11_1 are formed, a second splicing reaction toward inclusion can occur. All terminal splice products are subject to degradation, for which we adopted the same assumptions and notation as for the exon definition model. Again, model species that can be bound or spliced further (P0_0_0, P1_0_0, P0_a_0, P0_b_0, P0_1_0, P0_0_1, P1_1_0, P1_a_0, P1_b_0, P1_0_1, P0_a_1, P0_b_1, P0_1_1, P1_a_1, P1_b_1, P1_1_1) can be exported from the nucleus with the rate constant kret. The ordinary differential equations of the model are given in Table S4. Model simulation and analysis The differential equations were implemented inMatlab 2020b and solved using ode15s. To analyze splicing outcomes, we assumed a steady state, and performed numerical simulations over long time periods (t = 1,000,000min) to ensure that the concentrations of the model species remained constant. Thus, we consider an RNA sequencing experiment, in which gene expression was measured in a stationary cell population in the absence of any external perturbation. As a measure of splicing outcome, we used the steady-state concentrations of inclusion and skipping (see also below). Genome-wide splicing modeling by parameter sampling The exon definition model consists of 15 kinetic parameters which belong to the following classes of reactions: spliceosome binding (k1, k2a, k2b, k3), spliceosome dissociation (k4, k5a, k5b, k6), splicing catalysis (i1, i2, i12), and others, which are rates of pre-mRNA syn- thesis (s), mRNA degradation (kint, kskip, kdr1, kdr2), and terminal intron retention (kret). The values of these parameters were unknown and likely greatly differ between exons in the human genome. To mimic the heterogeneity of exons in the human genome and to assess the robustness of our simulation results, we randomly sampled all kinetic parameters in our model 10,000 times. As a refer- ence parameter set, all parameter values were set to 1, except for kret, kincl, and kskip, which were set to 0.01 to ensure low levels of intron retention that are typically observed in RNA sequencing datasets. We sampled each parameter in themodel within a +/-seven- fold range around this reference using Latin hypercube sampling (lhsdesign command inMatlab). We performed simulations for each parameter realization and calculated PSI = inclusion / (inclusion + skipping) as a measure of alternative splicing. We obtained a PSI distribution between 0 and 1 that closely resembled the experimentally measured genome-wide PSI in control cells. The same pro- cedure was applied for intron definition, with the only difference being the number of parameters involved -17 in this case. These kinetic parameters belong to the following classes of reactions: spliceosome binding (k1, k2a, k2b, k3) and spliceosome dissociation (k4, k5a, k5b, k6); the remainder are identical to those used for exon definition. Modeling FUBP1 knockout effects To reproduce the FUBP1 KO data, we implemented two distinct assumptions about the mechanism of action of FUBP1: that FUBP1 affects late spliceosomal catalysis (i.e., the rate constants i1, i2 and/or i12), or that FUBP1 affects early spliceosomal binding (i.e., the rate constants k1–k6). For both mechanistic assumptions, we considered that FUBP1 predominantly binds long introns (Figure 6A). When simulating the effect FUBP1 KO has on splicing catalysis (model 2 in Figure 6A), we assumed that the splicing of short introns is unaffected, but that KO selectively reduces the splicing rate for the excision of long introns 3.5-fold compared to control. To reflect different combinations of long and short introns, we considered three scenarios in the FUBP1 KO simulations: (i) for the simulation of cassette exons flanked by two long introns, we assumed that the FUBP1KO slows all three splicing reactions in themodel, that is, the excision of intron 1, excision of intron 2 and exon skipping (i1, i2, and i12 are changed). (ii) For exons flanked by one short and one long intron, it was assumed that the splicing rate of the short intron is unaffected by FUBP1 KO, whereas splicing rates of the long intron and skipping are reduced. The long intron was either considered to be located upstream of the alternative exon (ii.a: i1 and i12 are changed) or downstream (ii.b: i2 and i12 are changed). In either case, the skipping reaction was considered as an FUBP1-dependent, long-range splicing event and was therefore perturbed in the FUBP1 KO simulation (i12 is changed). (iii) The third hypothetical sce- nario, in which an alternative exon is flanked by two short introns, was not explicitly considered in our simulations, as themodel would predict no PSI change upon FUBP1 KO in this case. For each parameter sample (hypothetical exon), the KO scenarios i, ii.a, and ii.b were implemented separately, resulting in three sets of 10,000 KO simulations. For each of these, the PSI changes upon FUBP1 KO were calculated [DPSI = PSI(KO) – PSI(control)], and the corresponding DPSI distribution (Figure 6B) agrees well with the experi- mental observation in RNA sequencing experiments. In the alternative FUBP1 KO implementation (model 1 in Figure 6A), we assumed that FUBP1 promotes initial U2 binding to the 30 splice site. Because the 30 splice site marks the downstream end of an intron, we assume that the FUBP1 KO reduces spliceosome binding to exons located downstream of long introns. In our model, a long intron 1, therefore, results in a reduced exon 2 definition rate upon FUBP1 KO (k2 changed 1.7-fold compared to control). Like- wise, a long intron 2 diminishes exon 3 definition (k3 changed 1.7-fold upon FUBP1KO). These perturbations were implemented alone (one long and one short intron), or in combination (two long introns), and the corresponding DPSI distributions across all 10,000 parameter realizations are shown in Figure 6B. The perturbation in binding parameters (k2, k3) was chosen to be smaller (1.7-fold) than the effect on splicing parameters (3.5-fold, model described above) to adjust for similar-sized effects on splicing in both imple- mentations. In contrast to the FUBP1 KO RNA sequencing data, these spliceosome binding simulations predict opposite PSI changes for short introns being located upstream or downstream of the alternative exon. Hence, a model in which FUBP1 enhances the catalytic excision of long introns explains the FUBP1KOdata better when compared to amodel in which FUBP1 primarily helps to e14 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 78 ll Article OPEN ACCESS recruit the pioneering U2 subunit to the 30 splice site. The same FUBP1 KO simulations were also implemented in the intron definition scenario. Here, the effect of FUBP1 on spliceosome binding (model 1 in Figure 6A) was assumed to affect the k2a parameter for a long upstream intron and k3 for long downstream introns. If both introns are long, FUBP1 influences both k2a and k3. The effect of FUBP1 on splicing catalysis (model 2 in Figure 6A) in the intron definitionmodel was implemented in the sameway as described above for the exon definition model. For FUBP1-based mechanisms of action, that is, binding and catalysis effects, very similar results were observed for the intron and exon definition scenarios (Figure S7G). Hence, the model’s prediction that FUBP1 affects splicing catal- ysis is robust and does not depend on the mechanism of splicing decision making. Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e15 79 2.3.1 Supplementary material 80 Molecular Cell, Volume 83 Supplemental information FUBP1 is a general splicing factor facilitating 30 splice site recognition and splicing of long introns Stefanie Ebersberger, Clara Hipp, Miriam M. Mulorz, Andreas Buchbender, Dalmira Hubrich, Hyun-Seo Kang, Santiago Martínez-Lumbreras, Panajot Kristofori, F.X. Reymond Sutandy, Lidia Llacsahuanga Allcca, Jonas Schönfeld, Cem Bakisoglu, Anke Busch, Heike Hänel, Kerstin Tretow, Mareen Welzel, Antonella Di Liddo, Martin M. Möckel, Kathi Zarnack, Ingo Ebersberger, Stefan Legewie, Katja Luck, Michael Sattler, and Julian König 81 Figure S1 A FUBP1 iCLIP U2AF2 iCLIP SF1 iCLIP SF3B1 iCLIP PTBP1 iCLIP 3‘ ss 5.9% 4.4% 5‘ ss 12.7% 1.6% 8.8% 2.8% 2.4% 3.7% 36.6% Intron 0.2% 34.8% 16.4% 9.0% 0.2% 0.3% 1.9% 3.4% CDS 53.5% 0.0% 57.2% 8.9% 0.1% 10.6% 0.1% 55.6% 0.1% 1.8% 1.6% 2.2% 2.2% 0.0% 0.4%17.9% 5‘ UTR 21.4% 19.7% 1.5% 15.8%43.6% 21.1% 3‘ UTR 19.3% ncRNA Intergenic B C regionsshort short Saturation curve of downsampled reads VPS13D VPS13D mutatedGGAUUUGUGUCUUUGCUU GGACUCGUGUCCUCGCUU FUBP1 CUACUUUUCAUCCCUUCU CUACUCUCCAUCCCCUCU U2AF2 SF1 1 457 FUBP1N-box+KH [nM] SF3B1 PTBP1 100 FUBP1N-box+KH -VPS13D 75 complex Unbound 50 VPS13D RNA (100 nM) 25 100 80 0 60 0 3 6 9 40 Splice site usage [log2] KD= 0.28 ±20 (junction read bins) 0.06 μM KD> 6 μM 0 0 5000 10000 0 5000 10000 FUBP1N-box+KH [nM] D 105 KH1-4 KH1-4 E FUBP1 KH1 KH2 KH3 KH4 KH1 KH2 8 110 4 115 0 120 -4 125 -8 Residue number 130 β1α1α2β2 β' α' β1α1 α2β2 β' α' β1α1 α2β2 β' α' β1α1 α2β2 β' α' 1.0 105 KH1-4 KH1-4 KH3 KH4 0.5 110 0 115 100 150 200 250 300 350 400 450-0.5 Residue number 120 1500 125 1000 130 500 10 9 8 7 10 9 8 7 0 ω - 1H (ppm) ω - 1H (ppm) 100 150 200 250 300 350 400 4502 2 800 Residue number 600 400 200 0 100 150 200 250 300 350 400 450 Residue number Figure S1. FUBP1 binding at 3' splice sites and RNA binding of KH domains (related to Figure 1B, 1E and 2B-D) (A) Distribution of binding sites across transcript regions for FUBP1 (n = 854,404), U2AF2 (n = 914,221), SF1 (n = 99,305), SF3B1 (n = 1,694,991), and PTBP1 (n = 127,450) iCLIP in HeLa cells (normalized for total transcript length). 3' and 5' splice site (ss) refer to 100 nt upstream and downstream of exons, respectively. CDS, coding sequence; UTR, untranslated region. 1 82 ω - 15N (ppm) ω - 151 1 N (ppm) Fractions of bound 3‘ splice sites (%) Fraction of bound VPS13D RNA (%) T2 [ms] T1 [ms] hetNOE ΔδCα - ΔδCβ 0 50 100 90 200 400 120 600 800 160 12001600 3200 6400 210 12800 250 0 50 100 300 200 400 600 340 800 1200 1600 390 3200 6400 12800 440 (B) Saturation analysis on downsampled iCLIP data (FUBP1, ~57,000,000 crosslink events; SF3B1, ~68,000,000; U2AF2, 54,000,000; SF, 58,000,000; PTB, 49,000,000), where the iCLIP data for each splicing factor have approximately the same sequencing depth. N- (C) Fluorescent electrophoretic mobility shift assay (EMSA) experiment on recombinant FUBP1 box+KH (aa 1–457, 50 nM–12.8 μM) binding to a shortened 36-nt RNA fragment from VPS13D (VPS13Dshort, 100 nM) (left). Agarose gel image (top) and quantification (bottom) with fitted curve show FUBP1–RNA binding in the nanomolar range (dissociation constant [KD] = 0.28 ± 0.06 μM). N-box+KH Agarose gel of a fluorescent EMSA experiment on recombinant FUBP1 (aa 1–457, 50 nM– 12.8 μM) binding to VPS13Dshort mutated (100 nM) with U-to-C mutations in U-rich stretches affording greatly reduced binding (right). 1 15 (D) Overlays of the H– N heteronuclear single quantum coherence (HSQC) spectra of FUBP1 KH1– 4 (black) with single KH domains (KH1, red; KH2, yellow; KH3, green; KH4, blue). Nuclear KH magnetic resonance (NMR) experiments of FUBP1 (KH1–4) show excellent spectral quality, despite the high molecular weight (~40 kDa), allowing most of the backbone chemical shifts to be assigned (310 out of 371 residues). 13 13 15 1 15 (E) Cα and Cβ secondary chemical shifts and N relaxation experiments: { H}- N heteronuclear nuclear Overhauser effect (NOE), T1, T2, of the four KH domains of FUBP1. Folded KH domains exhibit more rigidity (NOE ~ 0.9, T1 ~ 1s, T2 ~ 60 ms) whereas linker regions are more flexible (lower NOE, lower T1, higher T2). 2 83 Figure S2 A B SIA (scaffold-independent analysis) 0.2 + TTTTG 0.3 + TCTGT+ UUUUG 0.2 + UCUGU +NGNNN 0.1 0.1 1. NANNN +NCNNN 0 090 110 130 150 170 280 300 320 340 360 2. NCNNN +NANNN+NTNNN Residue number Residue number 3. NGNNN Molar ratio1:0 FUBP11:0.5 NTNNN Molar ra 1 1t : :io 1 2 4. 1:0 1:41:0.5 KH1 KH2 KH3 KH4 Molar ra1t:1 1:0 1: io2 1:0.5 1:4 Molar ra1:11ti:o2 1:0 16. NNNNT 1:41:0.51:11:2 1:4 Position 1 0.2 + TTTGT 0.2 + TTTTG ω 12 - H (ppm) 0.1 + UUUGU 0.1 + UUUUG Nucleotide DNA pools NMR preference 0 0180 200 220 240 380 400 420 440 Residue number Residue number C FUBP1KH1 D FUBP1KH2K106 F114 G120 I134 105 G120 +TTTTG Molar 105 G202 +TTTGT Molar I201 G205 I222 G225ratio V107 I116 I123 Q135 G205 ratio G202 K209 Q223 T2290.6 110 1:0 G225 1:0 0.5 V107 I116 1:0.5 K D= 344.5 ± 43.3 μM 110 1:0.5 K D= 731.9 ± 78.8 μM 115 1:1 1:1 0.4 1:2 0.4 115 T229F114 I201 1:2120 1:4 1:4 0.3120 1:6 125 I134 I123 1:6 0.2 1:8 0.2 1:8 K106 125 Q223 K209 1:10 0.1130 Q135 1:12 0 130 I222 1:16 0 9.5 9 8.5 8 7.5 7 6.5 0 1 2 3 4 5 6 7 8 9 8.5 8 7.5 7 0 2 4 6 8 10 12 14 16 ω - 1H (ppm) [TTTTG]/[FUBP1KH1] ω - H (ppm) [TTTGT]/[FUBP1KH2]2 2 E FUBP1KH3 F 105 G288 G292+TCTGT Molarratio G288 G292 F311 Q324 FUBP1KH4 T388 I392 T398 K400 110 I291 K300 K312 I325 105 G396 Molar L390 G396 I399 I410 1:0 +TTTTG ratio 1:0.5 0.8 K D= 71.3 ± 10.3 μM 110 1:0 0.4 K D= 403.1 ± 35.5 μM115 I291 1:1 T388 1:0.51:2 1:4 0.6 115 1:1 120 F311 T398 I391 1:2 0.3 1:6 120 1:4 K300 1:8 0.4 L390 1:6 0.2 125 Q324 1:10 125 K400 1:8K312 1:12 0.2 1:10 0.1I410 I399 130 I325 0 130 1:12 0 9.5 9 8.5 8 7.5 7 0 2 4 6 8 10 12 10 9 8 7 0 2 4 6 8 10 12 ω - 1H (ppm) [TCTGT]/[FUBP1KH3] ω - 1H (ppm) [TTTTG]/[FUBP1KH4]2 2 G FUBP1KH12 0.6 +TTTGTAAAATTTTG H FUBP1KH23 +TCTGTAAAATTTGT0.6 105 0.40.2 105 0.4 110 0 110 0.2 Molar 100 140 180 220 260 Molar 0 115 ratio 1:0 Residue number 115 ratio 180 220 260 300 340 1:0 120 120 1:0.5 Residue number1:0.5 KD=4.71±1.45 µM 1:1 KD=1.15±0.48 µM 125 1:1 0-20 125 0 0 130 -40 -1.0 130 -40 -1.0 -60 9 8 7 -80 -2.09.5 9 8.5 8 7.5 7 -2.0 ω - 1H (ppm -80) ω 1 2 - H (ppm) -3.0 2 0 0.8 1.6 0 30 60 0 1.0 2.0 0 30 60 Molar ratio Time [min] Molar ratio Time [min] I 0.6 +TTTTGAAAATCTGT J FUBP1 in vivo iCLIP FUBP1KH34 0.4 0.025 AUUU 105 0.2 0 0.020 UUUA UUUU 110 Molar 280 320 360 400 440 0.015 UUAU UAUU 115 ratio Residue number UUUC 120 1:0 KD=0.87±0.10 µM 0.010 UUAA AAUU 1:0.5 UUGU UUUG 125 1:1 -0 0 GUUU UGUU-20 0.005 130 UAAU-40 -1.0 0.000 10 9 8 7 -60 -2.0 ω - 1H (ppm) -802 0 0.8 1.6 0 30 60 0.00 0.01 0.02 0.03 0.04 Molar ratio Time [min] Relative motif frequency intop 20% FUBP1 binding sites K AAA+C CCC+G UUU+A L AAA+C/G M BRET for N Molar U2AF2RRM2FUBP1/U2AF2/SF1 ratio + FUBP1N-boxAAA+G GGG+A UUU+C CCC+A/G 1:0 I317 CCC+A GGG+C UUU+G GGG+A/C 0.30 1:0.5 0.015 1:1 G319 L279 Binding region 1:2 S2811:4 0.010 FUBP1 0.20 1:6 105 SF3B1 0.005 30 0.10 G265 CT280 115 F282 0.000 20 N0.00 125 -0.005 G3260.000 0.004 0.008 N321 10 9 8 7 L270 -0.010 Acc/Donexpression ratio K276 U2AF2RRM2 -200 -150 -100 -50 BP 0 Tested interaction Position relative 0 1 2 3 4 5 6 >6 Positive control V275 L325 A316 L320 E277 to branch point (nt) Number of motifs Negative control ω - 12 H (ppm) 3 84 15 Normalized positional ω1 - N (ppm) ω - 15 1 N (ppm) ω 15 1 - N (ppm) motif frequency ω1 - 15N (ppm) ω - 151 N (ppm) ΔH [kJ/mol] CSP [ppm] ΔH [kJ/mol] CSP [ppm] Δδobserved Δδobserved Fraction of introns (%) DP [µW] DP [µW] cBRET ω - 151 N (ppm) ω - 15N (ppm) ω - 15N (ppm) 1 CSP [ppm] CSP [ppm]1 Motif enrichment (top 20% binding sites vs non-bound regions) ω1 - 15N (ppm) ΔH [kJ/mol] CSP [ppm] Δδobserved Δδobserved CSP [ppm] CSP [ppm] DP [µW] ... Figure S2. Scaffold-independent analysis and titration curves for the final optimal binding motifs for each KH domain (related to Figure 2E-I, 3C, F) (A) Schematic workflow of the NMR-based scaffold-independent analysis (SIA) [S1]. SIA reports on the nucleic acid binding specificity of a given RNA-binding protein (RBP) at each position of a nucleic acid target. Sixteen 5-mer DNA pools with one specific nucleotide fixed at one position, otherwise randomized, are individually titrated to each KH domain. The observed changes in chemical shift of the selected peaks are averaged and normalized for each DNA pool to obtain a score for the nucleotide position and type preference. (B) Comparisons of chemical shift perturbations (CSPs) of all four FUBP1 KH domains upon addition of the optimal nucleotide motifs as either DNA or RNA (1:1 molar ratio of protein to RNA). This shows that DNA and RNA binding are very similar for all four KH domains of FUBP1. (C) NMR titration and dissociation constant (KD) calculation for the binding of FUBP1 KH1 (100 µM) with TTTTG up to a protein/DNA molar ratio of 1:8. The indicated residues are used for the calculation of KD. As expected for interactions of KH domains to nucleic acids, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG loop. (D) NMR titration and KD calculation for the binding of FUBP1 KH2 (100 µM) with TTTGT up to a protein/DNA molar ratio of 1:16. The marked residues are used for the calculation of KD. As expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG loop. (E) NMR titration and KD calculation for the binding of FUBP1 KH3 (100 µM) with TCTGT up to a protein/DNA molar ratio of 1:12. The marked residues are used for the calculation of KD. As expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG loop. (F) NMR titration and KD calculation for the binding of FUBP1 KH4 (100 µM) with TTTTG up to a protein/DNA molar ratio of 1:12. The marked residues are used for the calculation of KD. As expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG loop. (G) NMR titration and ITC of FUBP1 KH1–2 binding to a DNA oligonucleotide containing an optimal DNA motif for each KH domain derived by SIA linked by an A4 linker (TTTGTAAAATTTTG). Consistent with the KD values from ITC, the NMR titration indicates binding in an intermediate exchange regime. (H) NMR titration and ITC of FUBP1 KH2-3 binding to a DNA oligonucleotide containing an optimal DNA motif for each KH domain derived by SIA linked by an A4 linker (TCTGTAAAATTTGT). Consistent with the KD values from ITC, the NMR titration indicates binding in an intermediate exchange regime. (I) NMR titration and ITC of FUBP1 KH3–4 binding to a DNA oligonucleotide containing an optimal DNA motif for each KH domain derived by SIA linked by an A4 linker (TTTTGAAAATCTGT). Consistent with the KD values from ITC, the NMR titrations indicate binding in an intermediate exchange regime. (J) Motif enrichment in the in vivo FUBP1 iCLIP data. Disjunct 4-mer frequencies were calculated in extended windows (5-nt binding site ± 5 nt) around the top 20% of binding sites based on expression-normalized iCLIP signal and in non-bound regions in the same introns excluding a 20- nt region downstream of the 5' ss and a 150-nt region upstream of the branch point (BP). Enrichment for each motif is defined as the distance for each data point to the diagonal in the scatterplot of relative motif frequencies of the top 20% vs bottom 20% of binding sites. (K) Positional enrichment of FUBP1 binding motifs and control motifs relative to the BP. UUU+A/G/C, that is, 4-mers containing UUU interspersed at any position with A/G/C. Control motif sets are mononucleotide tracts interspersed by one other nucleotide. 4-mer frequencies were calculated position-wise upstream of the BP and compared to the average 4-mer frequencies in an 4 85 intronic control region (a 100-nt-long region 100 nt downstream of the 5' splice site). Shaded regions correspond to the main binding regions of FUBP1 (red) and SF3B1 (blue). (L) Abundance of FUBP1 binding motifs (UUU+A/G) at 3' ss of human introns. Abundance for other mononucleotide motifs (AAA + C/G & AAAA, CCC + G/C & CCCC, GGG + A/C & GGGG) is given for the purpose of comparison. (M) Total luminescence and fluorescence measurements were used to estimate the amounts of FUBP1 A38D ΔC W586,615R or the mutants FUBP1 , FUBP1 , and FUBP1 paired with wild-type U2AF2 and SF1 (orange), BCL2L1-BAD as a positive control pair (green) and pairs that are not known to interact with each other as negative controls (gray) in bioluminescence resonance energy transfer (BRET)- based assay. Acceptor/donor ratios are similar for all pairs, making the cBRET values more comparable to each other. RRM2 N-box(N) NMR titration of U2AF2 with FUBP1 up to sixfold molar excess (left). Significantly shifted peaks are enlarged. The peaks with a chemical shift perturbation (CSP) ≥ 0.1 are shown in RRM2 red along with corresponding residues on the structure of U2AF2 (right) (PDB ID: 8P25). 5 86 Figure S3 A U2AF2RRM12 B N-box + FUBP1 + FUBP1N74 + FUBP1Δ 25 FUBP1 56 115 GGVNDAFKDALQRARQIAAKIGGDAGTSLNSN A38 Tested FUBP1N-box mutations 120 1 1 2 P-rich A B 644 27 52 00 64 85 51 78 39 76 43 52 05 76 95 04 22 125 N-box 1 1 1 2 2 3 3 4 4 5 5 5 6 6 9 8 9 8 9 8 C D FUBP1N-box + U2AF2RRM2 0.6 FUBP1N-box + U2AF2RRM12 FUBP1N74 + U2AF2RRM12 Molar ratio G47 0.4 1:01:0.5 1:1 110 N 1:2 0.2 1:4 1:6 G46 0 115 0 10 20 30 40 50 60 70 Q36 FUBP1 residue number R39 E 120FUBP1N-box K4406 C 5 No U2AF2 I45 0 4 + U2AF2 A43 A30 125 2 FUBP1N-box 0 8.5 8.0 7.5 7.0 2 25 30 35 40 45 50 55 A38 ω - 12 H (ppm) A34 FUBP1 residue number F G A34 A38 A43 G46 H U2AF2RRM12 FUBP1N-box FUBP1N-box Q36 R39 K44 G47 80 80 0.8 + U2AF2RRM12 0.8 KD= 30.3 ± 5.3 μM 60 60 + U2AF2linker-RRM2 0.6 0.6 + U2AF2RRM2 0.4 40 40 0.4 0.2 20 20 0.2 0 0 0 0 0 1 2 3 4 5 6 N-box N74 RRM12 linker-25 30 35 40 45 50 55 RRM2 RRM2 FUBP1 residue number Molar ratio[U2AF2RRM2 ]/[FUBP1N-box] FUBP1 U2AF2 I Chimera of U2AF2linker-RRM2 and FUBP1N-box J Chimera of U2AF2linker-RRM2 and FUBP1N-box K 105 FUBP1N-box U2AF2linker-RRM2 GS-linker G319 + U2AF2linker-RRM2 105 +FUBP1N-box U2AF2RRM2 Molar ratio 1:0 G47 Molar ratio G265 110 1:0.5 1:01:1 110 1:0.5 1:2 1:1 1:4 G46 1:2 1:6 1:41:6 F282 115 R39 115 N321 L279 Q36 G326 L320 S281 120 I45 K44 120 K276 E277 A38 A43 Q315 125 A34 125 FUBP1N-box A316 L325 130 130 9 8 7 9 8 7 ω 12 - H (ppm) ω - 1H (ppm) U2AF2linker-RRM2 2 FUBP1N-box 231 GSGGSGSSGSGGSG 56 342 25 Figure S3. Determination of the minimal interaction interface between FUBP1N-box and U2AF2RRM2 (related to Figure 3F-H) RRM2(A) Comparison of a selected region of the one-point NMR titrations (0.5 molar ratio) of U2AF2 N74 ΔN (red) with full-length FUBP1 (cyan), FUBP1 (blue), and FUBP1 (black) showing significant chemical shift changes. (B) Overview of the FUBP1 construct used for NMR and BRET experiments. Red color marks mutations that are tested for effect on binding of FUBP1 with U2AF2. 6 87 ω - 151 N (ppm) CSP [ppm] Strand Helix CSP [ppm] ΔδCα - ΔδCβ Chemical shift differences [ppm] ω - 151 N (ppm) ω - 151 N (ppm) KD [µM] 76 μM 71 μM 76 μM 79 μM 82 μM (C) N-boxComparison of CSPs upon the titration of a shortened FUBP1 N-terminal construct (FUBP1 , N74 RRM12 aa 25–56; green) and FUBP1 (blue) with U2AF2 . N-box RRM2(D) NMR titration of FUBP1 with U2AF2 up to sixfold molar excess (left). Significantly shifted peaks are boxed and enlarged. The peaks with a CSP ≥ 0.1 are highlighted in red along with N-box corresponding residues on the structure of FUBP1 (right) (PDB ID: 8P25). N-box(E) Comparison of the Cα and Cβ chemical shift-derived secondary structure of free FUBP1 (blue) N-box RRM2 and FUBP1 bound to U2AF2 (orange). The fractional helical conformation for residues 30–45 in the absence of U2AF2 is further increased upon binding to U2AF2. N-box (F) Comparison of the CSP of FUBP1 titrations with U2AF2 constructs of various lengths RRM12 linker-RRM2 RRM2 (U2AF2 , black; U2AF2 , orange; U2AF2 , light blue). (G) N-box RRM2Calculation of KD for the FUBP1 and U2AF2 interaction derived by NMR titration. The N-box RRM2 changes in chemical shift of selected residues in the titration of FUBP1 with U2AF2 (shown in panel D) are plotted against the molar ratio of ligand to titrant. N-box N74 (H) Comparison of KD values for the interaction of FUBP1 (black) and FUBP1 (blue) with RRM12 N-box RRM12 linker-RRM2 U2AF2 and FUBP1 with U2AF2 (black), U2AF2 (orange), RRM2 and U2AF2 (light blue), determined by ITC (Table S2). The measurements were performed in triplicates and data are represented as mean ± SD. 1 15 linker-RRM2 N-box(I) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1 (cyan) N-box linker-RRM2 and FUBP1 titrated with U2AF2 (molar ratio of 1:6). 1 15 linker-RRM2 N-box(J) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1 (cyan) linker-RRM2 N-box and U2AF2 titrated with FUBP1 (molar ratio of 1:6). linker-RRM2(K) NMR ensemble (10 lowest energy structures) of the chimeric construct U2AF2 N-box (green)/FUBP1 (brown). The end of the flexible linker between RRM1-RRM2 (231–245) is not shown, the artificial GS-linker between the C terminus of U2AF2 RRM2 and the N-terminal N-box region of FUBP1 are indicated by gray dashed lines (PDB ID: 8P25). 7 88 Figure S4 A C RRM2 D PUF60 RRM2 U2AF2 N NM titration with 2 2 M2 PUF60 ---EARAFNRIYVASVHQDLSDDDIKSVFEA 248 U2AF2 RRM2 Molar ratio 2 2 STVVPDSAHKLFIGGLPNYLNDDQVKELLTS 281 1:0 1:0.5 1:1 PUF60 FGKIKSCTLARDPTTGKHKGYGFIEYEKAQS 279 1:2 2 2 FGPLKAFNLVKDSATGLSKGYAFCEYVDINV 312 12 12 WT PUF60 SQDAVSSMNLFDLGGQYLRVGKAVTPPMPLL 310 2 2 TDQAIAGLNGMQLGDKKLLVQRASVGAKNAT 343 C G47C 12I45F 12 FUBP1N-box 12 B N-box 12 FUBP1 1 1 2 P-rich A B 644 1212 Oligodendroglioma 12 Chronic lymphocytic leukemia 12 40 50 Uterine endometrioid carcinoma M 12 12 I45F Thyroid carcinoma FUBP1 N-box in Neuroendocrine tumor 12 Lung adenocarcinoma 12 G47C Colon Adenocarcinoma Melanoma 8.0 7.9 7.8 7.7 1 Unknown ω2 - (ppm) E F G Chimera o 2 2 M2 Chimera o 2 2 M2 M2 N-box and FUBP1 Chimera o 2 2 and FUBP1and FUBP1 FUBP1 Chimera o 2 2 M2 and FUBP1 Linker M2 FUBP1N-box 2 2 M2 105 105 +FUBP1N-box β1 α1 β2 β α2β β Molar ratio 800 110 1:0110 1:0.5 700 1:1 1:2 600 115 115 1:41:6 500 120 120 400 125 125 2 1 0 1001 0 0 9 8 7 9 8 7 22 2 ω2 - 1 (ppm) ω - 12 (ppm) e number H IFUBP1 + SF1 FUBP1 /SF1 FUBP1 + SF1 2 1 2 2 Negative control 0.1 50 0.10 0.0 0.00 0.00 0.01 2 0.000 0.004 0.008 Acc/Don Figure S4. The effects of cancer-related mutations in the FUBP1 N-box on the interaction with U2AF2 RRM2 (related to Figure 3C, 3I-J) (A) Sequential alignment (Clustal Omega [S4]) of the RRM2 domains of human PUF60 and U2AF2, mapping conserved residues (red) and similar residues (orange). Overlay of the structure of PUF60 RRM2 (black) and U2AF2 RRM2 (green) (adapted from [S2] and [S3]; PDB IDs: 2KXH, 6TR0). (B) FUBP1 N-box mutations identified in different cancer types. Databases (see STAR Methods) were screened for the occurrence of cancer-related mutations within the region of FUBP1 encoding for the N-box, yielding one insertion, five frameshifts (fs) leading to a premature termination codon (*) and 20 missense variants. 8 89 ω - 151 N (ppm) 1 M 2 I41F M I45F G47C A49G G50A T51A 2 N54K ω1 - 15N (ppm) T2 ω 151 - N (ppm) (C) Cancer-related mutations (labeled and side chains shown on the calculated structure of a chimeric RRM2 N-box construct of U2AF2 and FUBP1 , PDB ID: 8P25) within the helical binding region of N-box RRM2 FUBP1 and located at the interfaces with U2AF2 were selected for further NMR study. N-box (D) Comparison of the changes in chemical shift of residue A34 for the titration of FUBP1 wild- RRM2 type and mutants (L35V, A38D, A43E, K44R, I45F, G47C) upon adding U2AF2 . 1 15 N-box(E) Overlay of the H– N HSQC spectra of the A38D mutant of FUBP1 (red) with the chimeric linker-RRM2 N-box construct U2AF2 /FUBP1 with A38D mutation (cyan). 1 15 linker-RRM2 N-box(F) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1 with N-box linker-RRM2 A38D mutation (cyan) with the titration of FUBP1 with U2AF2 shows that the mutant spectrum resembles those of the unbound individual components. 15 linker-RRM2 N-box(G) Comparison of the N T2 relaxation rates of the chimeric constructs U2AF2 /FUBP1 wild-type and A38D mutant. Increased T2 relaxation rates in the N-box helix of the A38D mutant chimera compared to the wild-type is consistent with much weaker binding of the mutant to the U2AF2 RRM2. A38D (H) BRET titration curves shown for FUBP1 and FUBP1 versus SF1. As expected, mutation of the FUBP1 N-box does not result in significant loss of binding to SF1. Two biological replicates are shown, each done in technical triplicates. Error bars represent the standard deviation. (I) Total luminescence (Don) and fluorescence (Acc) ratios were determined for FUBP1 and A38D FUBP1 versus SF1. Acceptor/donor ratios are similar for all pairs making the cBRET values more comparable to each other. 9 90 Figure S5 A in vitro iCLIP oligo signal correlation B in vitro iCLIP peak signal correlation C U2AF2 50 nM + U2AF2 50 nM + 10 D 5 1 1 Rep1 1 0.95 0.94 0.87 0.87 0.87 0.86 0.87 0.88 Rep1 1 0.89 0.88 0.82 0.82 0.82 0.77 0.77 0.78 0.98 0.98 Rep2 1 0.96 0.88 0.89 0.88 0.88 0.89 0.89 0 0.95 Rep2 1 0.91 0.85 0.86 0.85 0.81 0.81 0.82 0.95 Rep3 1 0.88 0.88 0.88 0.87 0.88 0.89 0.92 Rep3 1 0.84 0.84 0.84 0.79 0.79 0.8 0.92 FUBP1 50 nM Rep1 1 0.95 0.95 0.89 0.91 0.91 0.9 FUBP1 50 nM Rep1 1 0.92 0.93 0.86 0.85 0.87 0.9 FUBP1 50 nM Rep2 1 0.95 0.9 0.91 0.92 0.88 FUBP1 50 nM Rep2 1 0.93 0.86 0.86 0.87 0.88 FUBP1 50 nM Rep3 1 0.89 0.91 0.91 0.85 FUBP1 50 nM Rep3 1 0.86 0.86 0.87 0.85 FUBP1 300 nM Rep1 1 0.93 0.94 0.82 0.82FUBP1 300 nM Rep1 1 0.88 0.88 0.8 FUBP1 300 nM Rep2 0.81 0.94 FUBP1 300 nM Rep2 1 0.9 0.78 0.78 FUBP1 300 nM Rep3 1 FUBP1 300 nM Rep3 1 0.75 0.75 D in vitro iCLIP per nucleotide correlation E in vitro iCLIP peak signal correlation F 4 1 1 U2AF2 Rep1 1 0.95 0.95 0.86 0.89 0.92 0.92 0.95 0.95 U2AF2 Rep1 1 0.99 0.99 0.86 0.89 0.83 0.83 0.99 0.98 0.98 0.98 U2AF2 Rep2 21 0.95 0.87 0.9 0.92 0.92 0.95 0.95 U2AF2 Rep2 1 0.99 0.87 0.9 0.85 0.85 0.99 0.99 0.95 0.95 U2AF2 Rep3 1 0.87 0.89 0.93 0.92 0.95 0.96 U2AF2 Rep3 1 0.89 0.9 0.87 0.87 0.99 0.990.92 0.92 U2AF2 FL Rep1 1 0.96 0.95 0.95 0.87 0.88 U2AF2 FL Rep10.9 1 0.98 0.97 0.97 0.89 0.89 0.9 0 U2AF2 FL Rep2 1 0.95 0.95 0.89 0.9 0.88 U2AF2 FL Rep2 1 0.95 0.95 0.91 0.91 0.88 U2AF2 ΔN Rep1 1 0.98 0.92 0.94 0.85 U2AF2 ΔN Rep1 1 0.99 0.88 0.88 0.85 0.82 U2AF2 ΔN Rep2 1 0.87 0.88 0.82U2AF2 ΔN Rep2 1 0.92 0.93 U2AF2 N74 Rep1 0.81 0.96 U2AF2 N74 Rep1 0.8 1 0.99 0.78 0.78 U2AF2 N74 Rep1 1 U2AF2 N74 Rep1 1 0.75 0.75 G H Up: 904 Down: 410 n.s. N NLS 1 2 Pro A B WT 5 Allele 1+2 15 AGGGGGGGGGGGVNDAFKDALQRARQIAAKIGGDAGTSLNSN... A38 MYCNboxmut 0 Allele 1+2 15 AGGGGGGGGGGGVNDAAESRKLT---IAAKIGGDAGTSLNSN... U2AF2 KO FUBP1-5 Allele 1 15 AGGGGGGGGGGGVNDAD---CSKNWR* Allele 2 15 AGGGGGGGGGGGVNDAGPADCSKNWR* -10 FUBP1 N-box 0 2 4 6 8 10 12 14 16 18 Mean expression [log2] I FUBP1 KO J FUBP1-NboxmutConst. control exons Control exons sxon Exons more included in FUBP1 KO e ntr ol s Const. exons less included in FUBP1 KO n dCo Ex o deu Exons less included in FUBP1 KO e in cl ns r o ed 10 1,000 100,000 xmo E lud 10 1,000 100,000c Minimum intron length (nt) s ins Minimum intron length (nt)le 10 91 U2AF2 50 nM + RPE1 cell lines Rep1 U2AF2 Rep1 Rep2 U2AF2 Rep2 Rep3 U2AF2 Rep3 FUBP1 50 nM Rep1 U2AF2 FL Rep1 FUBP1 50 nM Rep2 U2AF2 FL Rep2 FUBP1 50 nM Rep3 U2AF2 ΔN Rep1 FUBP1 300 nM Rep1 U2AF2 ΔN Rep2 FUBP1 300 nM Rep2 U2AF2 N74 Rep1 FUBP1 300 nM Rep3 U2AF2 N74 Rep1 U2AF2 50nM + Rep1 U2AF2 Rep1 Rep2 U2AF2 Rep2 Rep3 U2AF2 Rep3 FUBP1 50 nM Rep1 U2AF2 FL Rep1 FUBP1 50 nM Rep2 Fold change [log ] U2AF2 FL Rep22 FUBP1 50 nM Rep3 KO/WT U2AF2 ΔN Rep1 FUBP1 300 nM Rep1 U2AF2 ΔN Rep2 FUBP1 300 nM Rep2 U2AF2 N74 Rep1 FUBP1 300 nM Rep3 U2AF2 N74 Rep1 U2AF2 peak signal U2AF2 peak signal (fold change over U2AF2 alone [log2]) (fold change over U2AF2 alone [log2]) U2AF2 + 50 nM FUBP1 U2AF2 + FUBP1N74 U2AF2 + FUBP1 N U2AF2 + 300 nM FUBP1 U2AF2 + FUBP1FL n.s. n.s. n.s*.** * ** Figure S5. Reproducibility between replicates and changes in U2AF2RRM12 binding from in vitro iCLIP experiments and expression and splicing changes upon FUBP1 KO (related to Figure 4A- F, 5A-B) (A) Reproducibility of in vitro iCLIP data with oligonucleotide-derived transcript library. The RRM12 correlation matrix shows pairwise Pearson correlation of U2AF2 crosslink events per RRM12 oligonucleotide (n = 1,998) between samples. Experiments were performed with U2AF2 alone (50 nM) and with the addition of full-length FUBP1 at 50 or 300 nM. (B) Reproducibility of in vitro iCLIP data with oligonucleotide-derived transcript library. The RRM12 correlation matrix shows pairwise Pearson correlation of total U2AF2 crosslink events inside U2AF2 binding sites between samples (1,831 oligonucleotides harbor a U2AF2 binding sites according to U2AF2 in vivo iCLIP). Experiments as in panel A. RRM12(C) Comparative boxplot of normalized U2AF2 crosslink events per binding site between conditions (n = 1,504). Experiments as in panel A. (D) Reproducibility of in vitro iCLIP data with eight long in vitro transcripts [S5]. The correlation RRM12 matrix shows pairwise Pearson correlation of U2AF2 crosslink events per nucleotide over all in vitro RRM12 transcripts between samples. Experiments were performed with U2AF2 alone (50 nM) FL N74 ΔN and with the addition of full-length FUBP1 , FUBP1 , and FUBP1 (all 50 nM). (E) Reproducibility of in vitro iCLIP data with eight long in vitro transcripts [S5]. Correlation matrix shows pairwise Pearson correlation of total binding signals (n = 109) between samples. Experiments as in panel D. RRM12 (F) Comparative boxplot of normalized U2AF2 crosslink events between conditions (n = 109). Experiments as in panel D. N-box (G) Zoom-in of the FUBP1 sequence, which when targeted with CRISPR/Cas9 results in a mut knockout cell line (FUBP1 KO) and a mutant cell line (FUBP1-Nbox ), in which FUBP1 lacks the U2AF2 interaction surface. (H) Log2 fold change versus mean expression for genes upon FUBP1 KO in RPE1 cells. (I) Minimum adjacent intron length for cassette exons that are more or less included and for mut constitutive exons less included in FUBP1-Nbox RPE1 cells (n = 123/249/27) compared to unchanged control exons (n = 4,584) and unchanged constitutive control exons (n = 5,717). mut (J) Minimum adjacent intron length for cassette exons that are more or less included in FUBP1-Nbox RPE1 cells (n = 36/45) compared to unchanged control exons (n = 10,678). 11 92 Figure S6 A B 1 1 1 1 1 1 1 C 1 in vivo C MPDZ 1 N NLS 1 2 Pro A B 644 FUBP1 FUBP1 2 1 MPDZ ΔBS MPDZΔintron MPDZΔintron+ΔBSC 1 M 1 D MPDZ MPDZΔBS MPDZΔintron MPDZΔintron+ΔBS 1 80 MPDZ 60 KD 2 2 M d 0 MPDZ 0 1 2 1 M 1 M WT ut T tm mu T mutW W WT ut x x x xm -boN N- bo bo bo N- N- E ' F ' M M 2 0.0 2 1 1 1 1 0.0 1 11 1 1 2 1 2 1 1 1 2 1 2 1 1 BP 2 M 2 1 1 BP 2 M ' ' G H M M 0.6 2 2 1 1 21 1 2 1 1 1 1 0.0 1 1 1 2 2 1 0.0 1 2 M 2 1 1 BP 2 2 1 1 BP 2 M I J 1 ' ' 1 1 2 1 1 2 1 1 2 0 2 1 2 2 21 0 10.00 K 1 2 1 0 1 2 1 0 11 0 2 1 2 0.00 0 1 0 1 1 L 2 M N 1 1 1 1 2 2 2 2 2 0 1 1 1 1 1 1 1 2 1 1 2 0 1 2 1 2 0 1 2 C C 12 93 C C 2 C 2 11 1 2 0 1 2 11 1 M 2 1 2 1 221 2 21 2 2 2 12 600 1 1 800 1 12 C 2 2 2 1 2 M o MPDZ 12M 1 1 11 2 22 C C C C 2 2 Figure S6. FUBP1 effects on long introns (related to Figure 5C-H) (A) Position and identity of FUBP1 loss-of-function (LoF) mutations in glioma patients with 1p/19q deletion-positive background [S6]. (B) Genome browser view of the region included in the MPDZ minigene displaying the in vivo iCLIP data (crosslink events per nucleotide) of FUBP1 (orange). Deletions of introns with/without FUBP1 binding sites are indicated below with red bars. N-box+KH (C) EMSA experiment to demonstrate binding of recombinant FUBP1 (aa 1–457, 25–3200 nM) to a fluorescently labeled 132-nt RNA fragment from MPDZ (100 nM). Agarose gel image (bottom) and quantification (top) with fitted curve show FUBP1–RNA binding in a nanomolar range (KD = 0.23 ± 0.03 μM). (D) Capillary electrophoresis of exon inclusion levels upon intron shortening in the MPDZ minigene. (E) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in dependency on 3' splice site strength. iCLIP signals are normalized for expression and then averaged per nucleotide over all introns (left). Binding enrichment quantification: Area under the curve (AUC) in each intron class compared to the AUC in introns with very low 3' splice site strength (right). (F) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in dependency on 5' splice site strength. iCLIP signals are normalized for expression and then averaged per nucleotide over all introns (left). Binding enrichment quantification: AUC in each intron class compared to the AUC in introns with very low 5' splice site strength (right). (G) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in dependency on Py tract strength. iCLIP signals are normalized for expression and then averaged per nucleotide over all introns (left). Binding enrichment quantification: AUC in each intron class compared to the AUC in introns with very low Py tract strength (right). (H) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in dependency on BP strength. iCLIP signals are normalized for expression and then averaged per nucleotide over all introns (left). Binding enrichment quantification: AUC in each intron class compared to the AUC in introns with very weak BP strength (right). (I) Fraction of introns with 0, 1, 2, 3, or > 3 motif sets of size 9 of random 4-mers in dependency on intron length. Random sets were drawn 100 times and the resulting fractions were then averaged. (J) Cumulative distribution of splice site features conditioned on intron length. (K) Number of FUBP1-binding motifs upstream of the BP ([−100 nt; −26 nt]) in dependency on differential GC content. Differential GC content is the GC content of the exon minus that of the first 100 nt of the downstream intron. (L) Enrichment of FUBP1 binding upstream of the branch point in dependency on exon/intron GC content and exon rank. In the underlying metaprofiles, iCLIP signals are normalized for expression and then averaged per nucleotide over all introns. (M) Fraction of introns with 0, 1, 2, 3 or > 3 motif sets of size 9 of random 4-mers in dependency on differential GC content. Random sets are drawn 100 times and resulting fractions are then averaged. (N) Percent of introns bounds through different scopes of Euclidean distances where 1 means the nuclear center and 5 is the periphery. Enrichment is shown compared to the first scope. Based on data from [S7]. 13 94 Figure S7 A 2500 GC architecture B C 1.0 0.752000 FUBP1 Upstream exon 1500 Differential SF10.5 0.50 IntronSF3B1 1000 U2AF65 0.0 PTBP1 0.25 500 -200 -150 -100 -50 BP Leveled Position relative to branch point (nt) Leveled Differential Leveled DifferentialGC architecture GC architecture D E GC architecture Intron length Mammalian vertebrates [ 100, 400] Differential ( 400, 1000] (1000, 2000] Invertebrates (2000, 4000] Fungi Leveled (4000, 17000] Plants 3 Protozoa 2 1 10 100 1,000 10,000 1 Median intron length 0 F Synthesis Unbound Bound Intron -1 s 0 1 -2 Unspliced Pre-mRNAIntron1 Intron2 Exon 1 Exon 2 Exon 3intermediates 0 0 0 Binding k Exon1 1 k2 k3 Exon2 Exon3 k k4 k5 k6 Unbinding k4 k5 k6 1 k k32 1 0 0 0 1 0 0 0 1 Full intron GC kret retention Intron length architecture 1 1 0 1 0 1 0 1 1 kdr1+kdr2 G i1 i2 Degradation1 1 1 Exon definition i Intron definition k 1 1 0 i 2 0 1 1dr2 k 1 i k12 k dr1 k 6 4Model prediction 3 k11 1 1 1 1 1 Model 1 Model 2 i2 iSecond intron 1 First intron 0.1 retention retention 0.1 kincl Inclusion kskip SkippingDegradation 0.0 0.0 Degradation -0.1 H BRET for I J -0.1 U1 proteins + FUBP1 0.30 FUBP1A/B N-term-WW12-0.2 PRPF40B Tested +SNRPA +FUBP1P-rich+A/B 0.25 interaction-0.2 -0.3 Positive ctrl Molar ratio Molar ratio 0.20 Negative ctrl 110 1 :0 110 1 :0 t t g t t g 1 :0.25 1 :0.25or or n or or n 1 :0 .5 115 1 :0 .5115 1 s h 2 s h s l o o 0.10 n 1 sh sh l n n 2 n s ro ro ntr o n o or ron ntr 0.15 120 120 Int Int h it In t Int io oth 0.05 125 B B 125 0.00 130 K 5R 0.00 0.02 0.04 10 9.0 8.0 7.0 10 9.0 8.0 7.0 ,61 1 1 D 86 Acc/Don expression ratio ω - H (ppm) ω2 - H (ppm)FL A38 ΔN W5 21 1 1 1 1Δ C UB P P UB UB P BP BP -F F F -F U - - -FUP GF GF P GF P FP FP PG G GF L GFP- GFP- GFP- GFP-FUBP1 GFP-134 GFP FUBP1FL FUBP1A38D FUBP1ΔN W586,615R FUBP1ΔC 100 FUBP1 WT WT KO WT KO WT KO WT KO WT KO WT KO 80 700 134 FUBP1 KO 100 500 80 400 Inclusion 300 Skipping α-FUBP1 14 95 RPE1 cell line Splicing change Binding upon FUBP1 KO [ PSI] enrichment [log ] Normalized2 iCLIP signal [ 100, 400] ( 400, 1000] (1000, 2000] (2000, 4000] (4000,17000] cBRET Degradation Binding enrichment [log2] ω - 151 N (ppm) GC content ω - 15N (ppm) Kinetics of1 Degradation Figure S7. Characterization and modeling of FUBP1 binding behavior (related to Figure 5E, 5I- J, 6A-B, 6G-I) (A) Metaprofile showing the number of crosslink events of FUBP1 relative to the BP in dependency on differential GC content. iCLIP signals are normalized for expression and then averaged per nucleotide over all introns. (B) Binding enrichment quantification: AUC in each intron class compared to the AUC in introns with leveled GC content. (C) Comparison of exon and intron GC content for exons with increasing differential GC content architecture. (D) Enrichment of FUBP1 binding upstream of the branch point in dependency on intron length and differential GC content. Exons were classified into each intron length groups and then split by GC content architectures (left panel) and vice versa (right panel). (E) Intron length distribution between kingdoms. Analyses were performed for 174 mammals, 274 non-mammalian vertebrates, 277 invertebrates, 410 fungal species, 94 protozoa, and 145 plants. (F) Detailed scheme of the mathematical model describing exon definition and splicing for a cassette exon flanked by two constitutive exons. After pre-mRNA synthesis (s), the three exons (indicated by boxes) can be cooperatively and reversibly bound by the pioneering spliceosome subunits U1 and U2 (these are not explicitly displayed in the scheme). Colorless (0) and colored (1) squares represent bound ("defined") and unbound ("undefined") exons, respectively. Red, green, and blue arrows represent binding to and dissociation from exons 1, 2, and 3, respectively, where k1–k3 are the corresponding rate constants of binding and k4–k6 the rate constants of dissociation. Based on the exon definition patterns (highlighted by red ellipse), splicing decisions towards multiple splice isoforms (inclusion, skipping, first intron retention, second intron retention, full intron retention) are made, and it is assumed that an intron can be excised if the two neighboring exons are defined. For instance, skipping of exon 2 is possible from the state 1_0_1 and occurs with the rate i12. Likewise, splicing of the first intron occurs from the species P1_1_0 and P1_1_1 (rate i1), and splicing of the second intron from P0_1_1 and P1_1_1 (rate i2). The inclusion isoform is generated in two steps, i.e., from the subsequent removal of intron 1 and intron 2 in random order. All terminal splice products are subject to degradation (kincl: degradation rate constant of inclusion, kskip: skipping, kdr1: first intron retention, kdr2: second intron retention, kdr1+kdr2: full intron retention). (G) The intron and exon definition models show similar splicing changes upon FUBP1 knockout (KO). We simulated the splicing changes upon FUBP1 KO based on the assumption that FUBP1 affects the rate of spliceosome binding to the 3’ splice site of long introns (left panel) or the rate off splicing catalysis across long introns (right panel) as described in detail in the STAR Methods. To account for the heterogeneity of exons in the human genome, we randomly sampled the kinetic parameters of the model 10,000 times to generate an ensemble of 10,000 in silico exons. We then simulated FUBP1 KO for each in silico exon, assuming that FUBP1 selectively enhances the rate of splicing for long introns, and considered three scenarios reflecting different length configurations of upstream and downstream introns (see STAR Methods for details). The boxplots show the distributions of ΔPSI = PSI(KO) – PSI(control) values for exon (red) and intron definition (blue) across all exons. (H) Total luminescence and fluorescence measurements were used to estimate the amount of FUBP1 paired with the components of U1 complex (orange), BCL2L1–BAD as a positive control pair (green) and pairs that are not known to interact with each other as negative controls 15 96 (gray). Acceptor/donor ratios are similar for all pairs making the cBRET values more comparable to each other. 1 15 A/B(I) H– N HSQC spectra of the titration of FUBP1 with SNRPA up to a molar ratio of 1:1. (J) 1 15 N-term-WW12 P-rich+A/BH– N HSQC spectra of the titration of PRPF40B with FUBP1 up to a molar ratio of 1:0.5. (K) Western blot to verify FUBP1 construct expression after transfection of RPE1 WT and FUBP1 KO cells. (L) Capillary electrophoresis of exon inclusion levels of the MPDZ minigene after transfection of RPE1 WT and FUBP1 KO cells with different FUBP1 constructs. 16 97 Supplementary Tables Table S2. Binding affinities and stoichiometries determined by ITC experiments (related to Figures 2D, 2F, S2G–I, and S3H). Experiments were performed for different FUBP1 N-terminal N-box N74 RRM12 linker-RRM2 constructs (FUBP1 , FUBP1 ) with U2AF2 constructs (U2AF2 , U2AF2 and RRM2 KH12 KH23 KH34 U2AF2 ) and various FUBP1 KH domain constructs (FUBP1 , FUBP1 , FUBP1 and KH FUBP1 ) with DNA or RNA. Analyte Titrant N sites KD [μM] Repeats RRM12 N-box U2AF2 FUBP1 0.98 ± 0.17 75.93 ± 2.70 3 RRM12 N74 U2AF2 FUBP1 0.85 ± 0.07 70.57 ± 2.41 3 linker-RRM2 N-box U2AF2 FUBP1 0.91 ± 0.16 78.97 ± 2.16 3 RRM2 N-box U2AF2 FUBP1 0.87 ± 0.16 82.03 ± 2.98 3 KH12 FUBP1 TTTGTAAAATTTTG 0.78 ± 0.07 4.71 ± 1.45 3 KH23 FUBP1 TCTGTAAAATTTGT 0.76 ± 0.09 1.15 ± 0.48 3 KH34 FUBP1 TTTTGAAAATCTGT 0.74 ± 0.04 0.87 ± 0.10 3 VPS13D KH RNA FUBP1 1.38 ± 0.04 0.428 ± 0.062 3 17 98 Supplementary References [S1] Beuth, Barbara, María Flor García-Mayoral, Ian A. Taylor, and Andres Ramos. 2007. “Scaffold- Independent Analysis of RNA-Protein Interactions: The Nova-1 KH3-RNA Complex.” Journal of the American Chemical Society 129 (33): 10205–10. [S2] Kang, Hyun-Seo, Carolina Sánchez-Rico, Stefanie Ebersberger, F. X. Reymond Sutandy, Anke Busch, Thomas Welte, Ralf Stehle, et al. 2020. “An Autoinhibitory Intramolecular Interaction Proof-Reads RNA Recognition by the Essential Splicing Factor U2AF2.” Proceedings of the National Academy of Sciences of the United States of America 117 (13): 7140–49. [S3] Cukier, Cyprian D., David Hollingworth, Stephen R. Martin, Geoff Kelly, Irene Díaz-Moreno, and Andres Ramos. 2010. “Molecular Basis of FIR-Mediated c-Myc Transcriptional Control.” Nature Structural & Molecular Biology 17 (9): 1058–64. [S4] Madeira, Fábio, Young Mi Park, Joon Lee, Nicola Buso, Tamer Gur, Nandana Madhusoodanan, Prasad Basutkar, et al. 2019. “The EMBL-EBI Search and Sequence Analysis Tools APIs in 2019.” Nucleic Acids Research 47 (W1): W636–41. [S5] Sutandy, F. X. Reymond, Stefanie Ebersberger, Lu Huang, Anke Busch, Maximilian Bach, Hyun-Seo Kang, Jörg Fallmann, et al. 2018. “In Vitro iCLIP-Based Modeling Uncovers How the Splicing Factor U2AF2 Relies on Regulation by Cofactors.” Genome Research 28 (5): 699–713. [S6] Seiler, Michael, Shouyong Peng, Anant A. Agrawal, James Palacino, Teng Teng, Ping Zhu, Peter G. Smith, Cancer Genome Atlas Research Network, Silvia Buonamici, and Lihua Yu. 2018. “Somatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer Types.” Cell Reports 23 (1): 282–96.e4. [S7] Tammer, Luna, Ofir Hameiri, Ifat Keydar, Vanessa Rachel Roy, Asaf Ashkenazy-Titelman, Noélia Custódio, Itay Sason, et al. 2022. “Gene Architecture Directs Splicing Outcome in Separate Nuclear Spatial Regions.” Molecular Cell. Elsevier. 18 99 However, I performed the cloning of the ORFs, site-directed mutagen- esis and generated constructs in the low-throughput (eppi tube format). Next, I adapted these steps in medium-throughput (plate format). The protocol is described in Appendix, 5.1. The final pipeline followed by BRET-assay was tested in the second collaborative project see Article II. 2.4 Article II: Systematic discovery of protein in- teraction interfaces using AlphaFold and exper- imental validation Summary This project is focused on benchmarking AlphaFold-Multimer (AF- MM), its metrics and the application to identify novel protein interaction interfaces followed by experimental validation. AF-MM is a machine learning-based tool to predict structures of pro- tein interactions and complexes. While this tool was tested to predict different PPI interface types, there is a general lack of a comprehensive assessment of sensitivity and specificity and the potential biases of the tool and its metrics. The benchmarking of AF-MM is essential for its application in the prediction of PPI interfaces. Therefore, we system- atically benchmarked the tool’s ability to predict domain-domain and domain-motif interfaces. The predicted models were compared to the solved structures and found that 35 % of the putative DMIs were pre- dicted correctly including the positions of sidechains, whereas 34 % had correct backbone predictions. We also evaluated the metrics of AF-MM for their application in distinguishing known DMIs from random DMI pairs. We also examined the effect of sequence length on the tool’s performance and found that long fragments of full-length proteins might worsen the predictions. These findings motivated us to develop a fragmentation approach, where the overlapping fragments were used to predict novel DMIs in hu- man PPIs. We applied this strategy to 62 PPIs from the HuRI dataset, where proteins are disease-associated. This strategy improved the sen- sitivity but decreased AF-MM’s specificity. We further manually in- spected high-scoring models. We selected some models for further ex- perimental validation. Using a plate-based bioluminescence resonance energy transfer (BRET) assay, known for its sensitivity in detecting point mutation effects and motif-mediated protein-protein interactions (PPIs), we tested 28 of the 62 PPIs, where BRET signals were significant 100 for 11 of these 28 PPIs. Using the putative structures we selected key in- teracting residues, that are also conserved and designed mutations that potentially can disrupt the predicted interface and deletions of the pre- dicted motif. We further validated seven predicted interfaces. Moreover, we discovered a novel interface between PEX3 and PEX16 and proposed a model for their interaction with PEX19. However, our experimental data also showed inaccuracies and limitations of AF predictions, par- ticularly for FBXO28-STX1B, STX1B-VAMP2, ESRRG-PSMC5 and TRIM37-PNKP interfaces, which need more studies for interface eluci- dation. In summary, this project provided a thorough assessment of AF-MM and its metrics, a protein fragmentation strategy predicting novel PPI interfaces, successfully applied to proteins likely associated with neu- rodevelopmental disorders. Our prediction, experimentally validated for 6/7 novel interfaces offers molecular insights, while also highlighting the potential limitations of AF-MM and the need for further advancements to increase prediction accuracy. So far, this is the largest effort in us- ing AF-MM for PPI interface prediction coupled with experimentally validating predicted interfaces. 101 102 Article Systematic discovery of protein interaction interfaces using AlphaFold and experimental validation Chop Yan Lee 1,5, Dalmira Hubrich 1,5, Julia K Varga 2,5, Christian Schäfer 1, Mareen Welzel1, Eric Schumbera1,4, Milena Djokic1, Joelle M Strom 1, Jonas Schönfeld 1, Johanna L Geist1, Feyza Polat1, Toby J Gibson 3, Claudia Isabelle Keller Valsecchi1, Manjeet Kumar3, Ora Schueler-Furman 2✉ & Katja Luck 1✉ Abstract seen tremendous progress in the systematic mapping of human protein interactions enabling gene function prediction and the Structural resolution of protein interactions enables mechanistic study of genotype-to-phenotype relationships (Luck et al, 2020; and functional studies as well as interpretation of disease variants. Drew et al, 2017; Huttlin et al, 2021). However, to understand the However, structural data is still missing for most protein interac- molecular function of individual PPIs, co-existence or mutual tions because we lack computational and experimental tools at exclusivity of partner proteins in protein complexes, and the effect scale. This is particularly true for interactions mediated by short of mutations on protein function, structural information on how linear motifs occurring in disordered regions of proteins. We find these proteins interact with each other is required. Unfortunately, a that AlphaFold-Multimer predicts with high sensitivity but limited structure at atomic resolution is only available for ~4% of known specificity structures of domain-motif interactions when using human PPIs (Luck et al, 2020). Modular proteins interact with each small protein fragments as input. Sensitivity decreased sub- other using a variety of different functional elements such as stably stantially when using long protein fragments or full length proteins. folded domains, intrinsically disordered polypeptide regions, short We delineated a protein fragmentation strategy particularly suited linear motifs (hereafter referred to as motifs), or coiled-coil helices for the prediction of domain-motif interfaces and applied it to forming domain-domain, domain-motif, disorder-disorder, or interactions between human proteins associated with neurodeve- coiled-coil interfaces for example. Resources such as 3did (Mosca lopmental disorders. This enabled the prediction of highly confident et al, 2014) or the ELM database (ELM DB) (Kumar et al, 2022) and likely disease-related novel interfaces, which we further collect observed contacts between domain types and between experimentally corroborated for FBXO23-STX1B, STX1B-VAMP2, domains and motifs, respectively. Such interface type collections ESRRG-PSMC5, PEX3-PEX19, PEX3-PEX16, and SNRPB-GIGYF1 can be used to predict occurrences of known interface types in providing novel molecular insights for diverse biological pro- protein interactions (Weatheritt et al, 2012; Mosca et al, 2013). cesses. Our work highlights exciting perspectives, but also reveals However, it is reasonable to expect that many more protein clear limitations and the need for future developments to maximize interface types remain to be discovered. This is likely particularly the power of Alphafold-Multimer for interface predictions. true for motif-mediated PPIs, which are anticipated to number in the hundreds of thousands or millions (Tompa et al, 2014). Motifs Keywords AlphaFold; Protein Interaction Interface Prediction; Linear are short stretches of amino acids in disordered regions of proteins Motifs; Benchmarking; Experimental Validation that usually adopt a more rigid structure upon binding to folded Subject Categories Computational Biology; Structural Biology domains in interaction partners (Davey et al, 2012). Motif- https://doi.org/10.1038/s44320-023-00005-6 mediated interactions are of moderate binding affinity and thus, Received 3 August 2023; Revised 4 December 2023; are particularly suited to mediate dynamic cell regulatory and Accepted 5 December 2023 signaling events (Van Roey et al, 2012). However, due to the Published online: 15 January 2024 transient nature of their interactions and the disorderliness of motif-containing proteins, this mode of binding is also expected to be highly understudied. Systematically generated human protein Introduction interactome maps (Luck et al, 2020; Huttlin et al, 2021) are likely a treasure trove for the discovery of novel interface types, yet no good Protein-protein interactions (PPIs) are essential for the proper experimental or computational methods exist to systematically map functioning of essentially all cellular processes. The last decade has or predict protein interaction interfaces at scale. 1Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. 2Department of Microbiology and Molecular Genetics, Institute for Biomedical Research Israel-Canada, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel. 3Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg 69117, Germany. 4Present address: Computational Biology and Data Mining Group Biozentrum I, 55128 Mainz, Germany. 5These authors contributed equally: Chop Yan Lee, Dalmira Hubrich, Julia K Varga. ✉E-mail: ora.furman-schueler@mail.huji.ac.il; k.luck@imb-mainz.de © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 75 103 1234567890();,: Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al The release of the neural network-based software AlphaFold (AF) interfaces in unperturbed settings, it is still a method that is only was not only a breakthrough for the prediction of monomeric structures accessible to few experts in the field. Other experimental of proteins (Jumper et al, 2021) but multiple studies published shortly approaches are needed, which can, ideally at high throughput, thereafter also suggested the ability of AF to predict structures of confirm predicted interfaces for PPIs. In this study, we thoroughly pairwise protein interactions and complexes. Sensitivities of around 70% benchmarked the two most recent versions of AlphaFold-Multimer were reported using benchmark datasets of structurally resolved protein (hereafter referred to as AF) for their ability to predict domain- interactions originally developed to evaluate docking methods (Akdel domain and domain-motif interfaces (DDIs and DMIs). We found et al, 2022; Bryant et al, 2022; Johansson-Åkhe et al, 2021; that prediction accuracies drop when using longer protein preprint:Evans et al, 2021). Other studies focused on structures of fragments or full length proteins for interface predictions and domain-motif interfaces to specifically evaluate AF’s ability to predict developed a strategy particularly suited for the prediction of novel structures for this mode of binding, reporting similar success rates domain-motif interfaces in human PPIs. We applied this strategy to (Akdel et al, 2022; Johansson-Åkhe et al, 2021; Tsaban et al, 2022). Only 62 PPIs from HuRI that connect disease-associated proteins and a few studies have also evaluated AF’s specificity for the prediction of experimentally assessed the obtained interface predictions for seven interface structures using controls such as random protein pairs or PPIs using a plate-based bioluminescence resonance energy mutation of motifs to poly-alanine stretches (Akdel et al, 2022; transfer (BRET) assay (Trepte et al, 2018) combined with site- Johansson-Åkhe et al, 2021; Tsaban et al, 2022). Different benchmark- directed mutagenesis. We identify novel interface types and report ing studies used different versions of AF and reported on different on important limitations and sources of errors in AF-derived metrics for their ability to distinguish good from bad structural models structural models, which pave the way for future improvements in (Bryant et al, 2022; O’Reilly et al, 2023; Tsaban et al, 2022; the field. preprint:Evans et al, 2021; Teufel et al, 2023). We generally lack a comprehensive assessment of the latest AF releases and metrics across different types of PPI interfaces for their sensitivity, specificity, and Results potential biases for the prediction of complex structures. In a landmark study, researchers applied AF onto 65,000 human Evaluating AlphaFold’s accuracy for predicting domain- PPIs derived from a yeast two-hybrid-based interactome map motif interfaces (hereafter referred to as HuRI) and highly confident co-complex associations to structurally annotate the human interactome with AF- To thoroughly assess the ability of AF to predict structures of derived models. High confidence models were obtained for about binary protein complexes that are formed by a DMI, we extracted 3000 PPIs (Burke et al, 2023). The authors noted a smaller fraction of information on annotated DMI structures from the ELM DB highly confident structural models obtained for PPIs from the HuRI (Kumar et al, 2022). We selected one representative structure per dataset compared to the co-complex dataset and reported that motif class (136 structures in total), manually defined the minimal proteins in HuRI contain more intrinsic disorder and are less domain and motif boundaries, and submitted the corresponding conserved compared to proteins from co-complex datasets. AF model protein sequence fragments for interface prediction to AF (Fig. 1A; confidence scores also increased for PPIs with proteins that are less Dataset EV1). The domain sequences from this benchmark dataset disordered and more conserved, indicating that AF predictions work mostly shared 20–30% sequence identity (Appendix Fig. S1A). To less well for PPIs mediated by interfaces involving disordered regions evaluate the accuracy of the predicted structural models, we such as domain-motif interfaces, which likely dominate the human superimposed the actual structure and predicted model on their interactome (Tompa et al, 2014). However, AF benchmarking studies domains and based on this superimposition, we computed the all reported similarly high success rates for domain-motif interfaces atom RMSD between the motif of the predicted model and the compared to general docking benchmark datasets (Tsaban et al, 2022; actual structure (Fig. 1A). We found that 35% of the structural Akdel et al, 2022). These discrepancies in sensitivities could be a models were so accurately predicted that even the side chains of the result of two possible factors. First, they might point to differences in motif were correctly positioned while for another 32% the AF performance if small interacting fragments are used for interface backbone but not the side chains of the motif were accurately prediction, as done in the benchmark studies, versus full length predicted. For 26% of the structures the motif was modeled into the sequences used for structure prediction in (Burke et al, 2023). Second, correct pocket, but in a wrong conformation, while, for the these discrepancies could also point to difficulties of AF to predict remainder of the structures, AF failed to identify the right pocket structures of interface types involving disordered regions that have (Fig. 1A; Dataset EV1). A similar performance was obtained when not been solved before, of which there are likely many in HuRI. It using the DockQ metric (Appendix Fig. S1B,C; Dataset EV1). This remains to be addressed to what extent these two possible factors performance is unaltered when using or switching off AF’s template contribute to the challenges encountered specifically for domain- function (Fig. S1D,E). The use of DMI structures annotated by the motif interface modeling. ELM DB enables us to explore potential differences in AF’s Determination of accuracies of novel predicted interface performance regarding motif properties. We find no significant structures by AF ultimately requires experimentation. AF interface differences in average model accuracy between different categories predictions for individual PPIs have occasionally been experimen- of motif classes (two-sided Mann–Whitney test on all pairwise tally corroborated (Mishra et al, 2023; Bronkhorst et al, 2023). A combinations, n: DEG = 10, DOC = 21, LIG = 94, TRG = 9, MOD = more systematic experimental confirmation of AF interface models 2, α = 0.05, test statistics of all pairwise combinations between 15 has been conducted using crosslinking mass spectrometry (XL-MS) and 852, Appendix Fig. S1F), although the variance in model (Burke et al, 2023; O’Reilly et al, 2023). While in-cell XL-MS is a accuracy appears to differ between the motif classes. Similarly, we very elegant approach to obtain experimental information on PPI found no significant difference in prediction accuracy when 76 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 104 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology A Superimposition RMSD calculation on domain on motif Wrong pocket Annotate minimal DEG_APCC_KENBOX_2 136 solved DMI interacting regions 10Correct (7%) 48 Correctpocket 35 DEG_APCC_KENBOX_2 DEG_APCC_KENBOX_2 (26%) (35%) sidechain 43 (32%) Correct backbone AF DOC_USP7_UBL2_3 C DOC_USP7_UBL2_3 DOC_USP7_UBL2_3 MOD_SUMO_rev_2: UBE2I & PPIL4 1KPS Exclude motif with unsolved residue or PTM B D EEIKAEKEAKTQAILLEM Positive Ref. CLV_C14_Caspase3-7: CASP3 & ARHGDIB Random Ref. 5IAN 1 mutation in motif 2 mutations in motif Randomly paired DMI 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT Motif chain interface pLDDT 0.8 0.8 0.8 Average interface pLDDT pDockQ 0.6 0.6 0.6 iPAEResidue-residue contact Atom-atom contact 0.4 0.4 0.4 EDDDDELDSKLNYKP 0.2 0.2 0.2 0.0 0.0 0.0 E F G L257 S225 Y306 L138 Figure 1. Benchmarking and application of AF for DMI interface prediction using minimal interacting fragments. (A) Schematic illustrating the assembly of the DMI positive reference dataset and evaluation of AF prediction accuracies by superimposition of the solved and modeled structures. Blue and cyan indicate the domain and motif in the native structure, respectively. Orange and yellow indicate the domain and motif in the modeled structure, respectively. Proportion of structures of DMIs predicted by AF to different levels of accuracy is shown on the right. (B) Area under the Receiver Operating Characteristics Curve (AUROC) for different metrics using the DMI benchmark dataset as positive reference and the following different random reference sets: Left, 1 mutation introduced in conserved motif position; middle, 2 mutations introduced in conserved motif positions; right, random reshuffling of domain-motif pairs. Gray horizontal line indicates the AUROC of a random predictor. (C) Superimposition of AF structural model for motif class MOD_SUMO_rev_2 (orange) with homologous solved structure (PDB:1KPS) from motif class MOD_SUMO_for_1 (blue). The motif sequence used for prediction is indicated at the bottom, colored by pLDDT (dark blue=highest pLDDT). (D) Superimposition of AF structural model for motif class CLV_C14_Caspase3-7 (orange) with homologous structure (PDB:5IAN) solved with a peptide-like inhibitor (blue). The motif sequence used for prediction is indicated at the bottom, colored by pLDDT (dark blue=highest pLDDT). (E) AF prediction of a LIG_HCF-1_HBM_1 motif in CREBZF (orange) binding to the beta-propeller Kelch domain of HCFC1 (gray). Mutated domain residues for experimental testing are colored in green. (F) Close up on the interface shown between CREBZF and HCFC1 from (E). Coloring is the same as in (E). Key conserved motif residues are drawn as sticks. Mutated residues in the domain and motif for experimental testing are labeled. (G) BRET titration curves are shown for wildtype interactions and mutant constructs for CREBZF-HCFC1 pairs for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. stratifying by the secondary structure elements adopted by the sequence is (Pearson r < abs(0.08), α = 0.05 Appendix Fig. S1H–J). motifs (two-sided Mann–Whitney test on all pairwise combina- AF models display significantly more differences to structures tions, n: helix = 42, strand = 7, loop = 87, α = 0.05, test statistics of solved by other methods, i.e., NMR, than X-ray crystallography all pairwise combinations between 184 and 2029, Appendix Fig. (two-sided Mann–Whitney test, n: X-ray = 115, Others = 21, S1G), nor by how hydrophobic, symmetric, or degenerate the motif p < 0.01, test statistics = 811, Appendix Fig. S1K) possibly because © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 77 105 Area Under the Curve Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al NMR structures better represent structural dynamics that AF (and annotated as such in the MOD_SUMO_for_1 class). Here it is cannot capture, since it was trained to predict the crystallized forms interesting to see how very dissimilar binding modes (flexible for of proteins. MOD_SUMO_for_1, helical for MOD_SUMO_rev_2), are still able The all-atom motif RMSD significantly anti-correlates with to place the important binding residues in the same pockets various AF-derived metrics (Pearson r =−0.55, p-value < 0.05 (Fig. 1C). For CLV_C14_Caspase3-7, the structure of the caspase Appendix Fig. S1L,M; Dataset EV1) suggesting that these metrics bound to peptide-like inhibitors has been solved (e.g. PDB:1F1J, are indicative of good versus bad structural models and can be used PDB:5IAN, PDB:6KMZ), and structures of more distant caspases for de novo interface predictions. To evaluate AF’s ability to bound to a cleaved peptide substrate are also available. For identify high confident structural models of DMIs, we generated proteases, one great advantage of AF is the ability to model both the three different random DMI datasets. First, we randomly paired catalytically active enzyme and an uncleaved substrate, which is domain and motif sequences from the positive reference dataset practically impossible to solve experimentally (Fig. 1D). taking into account that no motif sequence was paired with a Finally, for LIG_HCF-1_HBM_1 we were not able to identify a domain sequence from the domain type that the motif is known to homologous structure in the PDB, hence, our AF-derived structural interact with. Second and third, we mutated one and two key motif models for this motif class are likely novel. Motifs of this class are residues, respectively, to residues of opposite chemico-physical bound by the N-terminal beta-propeller Kelch domain of HCFC1 properties. Based on the conservation of these key motif residues, consisting of six Kelch repeats. Kelch domains have been shown to we assume that the mutations would be disruptive to binding, at bind to motifs at a number of different sites, and thus, without least when experimentally tested using minimal interacting protein prior knowledge, it is difficult to determine where the HCFC1- fragments. Receiver operating characteristic (ROC) and precision- binding motif (HBM) would bind. HCFC1 is a transcription factor recall (PR) curves using the positive and random datasets (Fig. 1B; that associates with other transcription factors (Lu et al, 1997), Appendix Fig. S2A,B; Dataset EV2) show that the domain interface splice factors (Ajuh et al, 2002), and cell cycle regulators (Freiman residue pLDDT (for all metric definitions, see Methods) or the and Herr, 1997; Machida et al, 2009). We generated AF models of number of atoms or residues predicted to be in contact with each high confidence for the HCFC1 Kelch domain interacting with other, discriminated poorly between all reference datasets (AUC multiple motif instances that are annotated in the ELM DB. All around 0.64). Furthermore, we observed that all tested metrics complexes show the tyrosine of the motif docked into a deep pocket failed to discriminate interacting from non-interacting interfaces at the bottom/top of the Kelch domain (Fig. 1E,F; Appendix Fig. when mutating one motif residue (max AUC 0.66). However, the S2F–H), with slight variations in how the tyrosine is exactly AF-derived metrics model confidence (preprint:Evans et al, 2021), positioned in the pocket (Fig. S2F–H). Based on clone availability average interface residue pLDDT, average motif interface residue we selected the structural model between HCFC1 and CREBZF for pLDDT, pDockQ (Bryant et al, 2022), and iPAE (Teufel et al, 2023) experimental validation. For this purpose, we used a BRET protein discriminated well between both reference datasets when rando- interaction assay that is based on transient overexpression of two mizing domain-motif pairs or introducing two motif mutations proteins in HEK293 cells (Trepte et al, 2018). Both proteins are (max AUC 0.86, ROC statistics and ideal cutoffs can be found in expressed as fusion constructs either to the Nanoluc luciferase (the Dataset EV2). We also evaluated whether the top 5 reported models donor) or mCitrine (the acceptor). Interaction of both proteins by AF tend to be more similar to each other when corresponding to results in a BRET from the oxidized substrate of the donor to the a correct structural model (Pozzati et al, 2022) and found that this acceptor molecule, if both are close enough to each other for the feature has moderate predictive power (Appendix Fig. S2C). BRET to occur (see Methods for details). We observed significant binding and BRET saturation when assaying wildtype CREBZF and Application of AlphaFold for providing structural models HCFC1 proteins (Fig. 1G; Appendix Fig. S2I,J). Mutation of the for motif classes without available structural data [DE]H.Y motif tyrosine to alanine (Y306A) or mutation of two residues in the Kelch domain pocket (L257F, L138F), which are After evaluating the accuracy of AF to predict DMIs using minimal modeled to be in contact with the motif tyrosine or histidine interacting regions, we aimed to use this setup for the prediction of residue (Fig. 1F), strongly reduced BRET signals indicating structural models for motif classes in the ELM DB for which no weakening or loss of binding (Fig. 1G; Appendix Fig. S2I,J). A structure of a complex has been solved yet. We identified 125 such pathogenic mutation (S225N, source ClinVar (Henrie et al, 2018)) motif classes based on ELM DB annotations. Of those, we selected close to the pocket slightly reduced expression levels of HCFC1 but all domain-motif instances where both the motif and the domain did not result in loss of binding (Fig. 1F,G; Appendix Fig. S2I,J). were derived from human or mouse proteins and submitted the Our experiments suggest that a potential pathogenic mechanism of corresponding domain and motif sequences for structure predic- this mutation is not mediated via perturbed binding of partners to tion to AF (Dataset EV3). Using a motif chain pLDDT cutoff of > the Kelch repeat domain pocket of HCFC1 that we identified in this 70, we obtained confident structural models for 21 motif classes. study. Unfortunately, no assertion criteria for the annotation of this We manually inspected the structural models and noticed that even mutation to be pathogenic is provided by ClinVar meaning that the though these ELM classes have no annotations with structures, mutation is either not pathogenic after all or its pathogenicity is solved structures for an exact ELM instance or a very likely new mediated via another perturbed function not tested in this study. instance for the ELM class are available for 11 out of the 21 cases. Collectively, these experimental results support the structural For most others, a close homolog structure had been solved, i.e., for models of the HCFC1 Kelch domain pocket - motif interaction LIG_MYND_3 and LIG_MYND_1, a structure solved by NMR for and overall provide highly confident structural models for multiple a LIG_MYND_2 interaction is available (Appendix Fig. S2D,E). For motif classes of the ELM DB without available structural MOD_SUMO_rev_2, a structure of a reversed motif is available information (Dataset EV4). 78 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 106 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology Figure 2. Effect of protein fragment extensions on the accuracy of AF predictions. (A) Workflow established to assess changes in AF performance upon protein fragment extension. Blue and cyan indicate the domain and motif in the native structure, respectively. Orange and yellow indicate the domain and motif in the modeled structure, respectively. (B) Heatmap showing the fold change in motif RMSD before and after extension where positive values indicate improved predictions from extension and negative values indicate worse prediction outcomes upon extension. (C) Heatmap of the average model confidence for combinations of different motif and domain sequence extensions. (D) Optimal cutoffs derived for different metrics from ROC analysis benchmarking AF different motif and domain extensions from the reference dataset used in A and random pairings of domain and motif sequences. pLDDT-related metrics were divided by 100 for visualization purposes. (E, F) Superimposition of the structural model of the minimal (left, orange) or extended (right, yellow) motif sequence with the solved structure (motif in blue) for two different motif classes as indicated on the top of each panel. The motif sequence from the solved structure is indicated at the bottom. Motif residues are underlined, motif residues not resolved in the structure have a gray background. Sticks indicate the motif residues, domain surfaces are shown in gray based on experimental structures. (G) Superimposition of the structural model of the minimal (orange) and extended (yellow) motif sequence with the solved structure (motif in blue) for a motif instance from the motif class LIG_BIR_III. Motif sequence indicated as in (E). (H) Area under the Receiver Operating Characteristics Curve (AUROC) for different metrics using the DDI benchmark dataset as positive reference and randomly shuffled domain-domain pairs as random reference. Gray horizontal line indicates the AUROC of a random predictor. Evaluation of AlphaFold’s ability to predict interfaces in We then gradually extended the motif and domain sequences by full length proteins first adding flanking disordered regions, then neighboring folded domains before using the full length sequences (Fig. 2A). Most PPIs known to date have been identified using full length Comparison of the motif RMSD computed for extended versus protein sequences in systematic interactome mapping efforts. For minimal domain-motif pairs from the positive reference dataset the vast majority of these PPIs, no fragment or interface revealed that the addition of flanking disordered regions on the information is available. Thus, the question emerges how AF motif or domain side sometimes slightly improved prediction would perform on DMI predictions when longer protein sequences accuracies while the addition of neighboring structured domains or or full length proteins are submitted. To answer this question we the use of full length sequences led to a significant worsening of selected 31 DMI structures from the positive reference dataset used model accuracies (Fig. 2B; Dataset EV5). Interestingly, despite the above and generated random domain-motif pairs of those as fact that, for smaller extensions, model accuracies remained the negative control. The selected structures were sampled from same or slightly improved as determined by motif RMSD, AF- different prediction accuracy categories (Fig. 1A; Dataset EV5). derived metrics such as the model confidence or average motif © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 79 107 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al interface residue pLDDT gradually dropped with increasing to steric clashes. AF predicts the extended motif to bind in reversed fragment length (Fig. 2C; Appendix Fig. S3A-C). ROC plots of orientation and it is mostly pushed out of the pocket. This predictions for a benchmark consisting of the positive and random highlights the importance of not only incorporating sequence domain-motif pairs revealed that, upon extension, the optimal context but also knowledge about the biological context, wherever cutoff of model confidence and iPAE considerably changed as well possible, into AF modeling and model interpretation. (Fig. 2D; Appendix Figs. S3D,E, S4A; Dataset EV6). This means that different model confidence or iPAE cutoffs are to be used Evaluating AlphaFold’s performance for the prediction of depending on the length of the submitted protein sequences, which domain-domain interfaces is rather impractical and thus disfavors both metrics for DMI predictions. The average motif interface residue pLDDT metric Folded domains can not only interact with motifs but also with appeared to be more robust with respect to fragment length. Based other folded domains forming so-called domain-domain interfaces on these results we chose this as the main metric and a cutoff of 70 (DDIs). To enable simultaneous prediction of DDIs and DMIs in a to discriminate good from bad AF-generated DMI models given protein interaction, we set out to evaluate AlphaFold’s regardless of fragment length. performance on DDI predictions using a reference dataset of 48 DDI structures that we manually curated out of random selections Extending motif sequences for interface prediction with of domain-domain contact pairs extracted from 3did (Mosca et al, AlphaFold reveals important motif sequence context 2014). As a negative dataset, we randomized the pairing of these domains. Using ROC and PR statistics we found that AlphaFold Various studies have highlighted that flanking sequences of motifs performed slightly worse on this DDI benchmark dataset compared can influence binding affinities and specificities (Luck et al, 2012; to its performance on DMIs (max AUC 0.73 vs. 0.86) (Fig. 2H; Bugge et al, 2020). Motif annotations in the ELM DB usually refer Appendix Fig. S4D–F; Dataset EV7) but still showed significant to the core sequence of the motif, often because information on discriminative power. Interestingly, the best performing metric for putative roles of flanking sequences is missing. In the previous DDI predictions was the average interface pLDDT score with an section, we observed that some motif extensions notably improved optimal cutoff of 75, which ranked fourth for DMI predictions. AF prediction accuracies. In the hope that these cases would point to motifs with important sequence context, we manually inspected Comparison of AlphaFold v2.2 with v2.3 eight predictions for which the motif RMSD decreased by more than 1 Å when extending the minimal motif sequence once to the During the course of our work, AF multimer version 2.3 was left and right by the length of the motif (extension step 1 in Fig. 2A; released. To determine whether the new release improved DMI and Appendix Fig. S4B). DDI prediction accuracies, we repeated all benchmarking with AF By doing so interesting patterns emerged: The most prevalent v2.3 and found that motif RMSDs and other AF-derived metrics on contribution to increased prediction accuracies is the stabilization of average improved compared to AF v2.2 when using minimal the secondary structure of the motif contributed by both sidechain and interacting fragments (Appendix Fig. S5A–D; Dataset EV1, two- backbone atoms in the flanking regions, as shown for the interaction sided Wilcoxon signed-rank test on motif all atom RMSD: n = 136, involving the motif LIG_CAP-Gly_2 (Fig. 2E; Appendix Fig. S4C). For W = 2413, p < 0.0001). AF v2.3 still showed a decrease in prediction the LIG_NBox_RRM_1 motif, AF placed a part of the domain into the accuracy when using extended protein fragments but this decrease binding pocket rather than the motif, although the motif had the was less pronounced compared to the corresponding decrease for correct helical conformation. Elongation of the motif extended this v2.2 (Appendix Fig. S5E,F; Dataset EV5). Despite these improve- helix, thereby increasing the interaction surface and eventually ments on the sensitivity side of AF, when benchmarked against pushing out the domain’s tail from the pocket (Fig. 2F). This fits random datasets, overall prediction accuracies only slightly with other reports where AF has been shown to predict preferential improved compared to v2.2 (Appendix Fig. S5G,H; Appendix Fig. binding of competing motifs (Chang and Perez, 2023). For the S6A–C; Dataset EV2, EV6, EV7, EV8). LIG_HOMEOBOX class prediction, the motif is positioned in the wrong pocket unless flanking regions are included (Appendix Fig. Application of AlphaFold for the discovery of novel S4C). For DOC_MAPK_JIP1_4, motif extension results in an interfaces in protein interactions without any a priori extended motif conformation and consequently in a structural model interface information with lower overall RMSD (Appendix Fig. S4C). For the LIG_GYF class, most models converge into an inverse orientation of the Since the use of larger or full length protein sequences leads to a backbone except for one of the extended motifs, which lies in the poor sensitivity for DMI predictions by AF, we devised the binding pocket in the correct orientation (Appendix Fig. S4C). In following strategy for the use of AF for interface predictions for summary, these analyses point to motif classes whose sequence known protein interactions: Using AF models of the full length boundaries could be refined. monomeric structures of both interacting proteins, we decided on Interestingly, for a motif instance from the LIG_BIR_III_2 class, boundaries between structured domains and disordered regions slight motif extensions actually led to a substantial decrease in based on manual inspection (see Methods). We then fragmented prediction accuracy. In this case, the motif is located at a neo-N- the disordered regions by designing overlapping fragments varying terminus that is only revealed after cleavage of the protein by a in length from ten residues up to the length of the respective caspase (Fig. 2G). When the motif is extended in the context of the disordered region (Fig. 3A). We then paired disordered with full length protein, the residues now upstream of the previous neo- ordered, and ordered with ordered fragments for interface N-terminus likely impede binding of the motif into the pocket due prediction by AF (Fig. 3A). To assess to which extent this 80 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 108 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology A B C A B 2 No prediction10 Fragments Longest extensions performed Correct No result 1 obtained 5 6 Likely10 11 correct12 0 Wrong 3 AlphaFold 10 14 16 p1_ 1 B_1 X_2 S_1I K O_ 1 dH_ 1 _1 _1 _4 X _2 _1 _4 _1 A B Kea O 5 2 O II 1 _ _S W N G n F6 AM TLS RB _I BM JIP ATH Likely lch M2 _KE NB _T R a A P _ N IR AK _ _ M_CA L_F _U2 IG_ NA IG_ _B G2 PK 7_ wrong Questionable Ke NMD CC _A beta CR LM L P C L IG L G_ _ P OC 2 _O _U LIG _ L EF_ A C_M A USP DE DEG G_A _ D _AP LIGE G L IG C R LI G_ DO DO D T Highest scoring and repeatedly identified interface D F SYT1 TRIM37 TCF12 LIG4 CSNK2B XRCC4TLK2 NGLY1 PNKP CSNK2A1 MIP VAMP4 SYP QRICH1 PAX6 SET SLC16A2 NFE2L2 BICD2 UBA5PLP1 TMEM237 PUF60 POGZ MFFKCTD7 RARB PSMC3 WAC DCX GABARAPL2 TH PRKAR1B VEZF1 ESRRG PSMC5 MMGT1FBXO28 CUL3 HNRNPK MOBP ZBTB10 LZTR1PRKAR1A ARHGEF9 CAMK2G STX1B PEX16RORB ACTB FTSJ1 GNAI3 TNPO3 CAMK2A VAMP2 PEX3 ACTG1 CERT1 GPSM2 GCH1 SOX5 PEX19 AF prediction result based on inspection TBC1D23 TTC19 MAB21L2 OTX2 APTX SNRPB BRET detection: and solved structures: Not tested Correct SSBP3 FH AP1S2 RPS26 FLAD1 GIGYF1 Likely correct Tested, no interaction Questionable UBE3A RARS1 ASF1A PEX12 EBF3 NECAB2 Likely wrongTested, interaction Wrong Not mutated No result obtainedNo prediction TAT CCDC115 H4C8 TREX1 EBF2 KANSL1 Mutated performed E Number of PPIs 0 5 10 15 No prediction performed pDockQ score No result obtained from Burke et al. Wrong Likely wrong No Score<0.23 Questionable 0.23−0.5 Likely correct >0.5 Correct Figure 3. AF prediction and experiments on PPIs connecting NDD proteins. (A) Schematic of the fragmentation approach applied on a pair of interacting proteins, A and B. Proteins are fragmented into folded and disordered regions based on manual inspection. Disordered regions are further fragmented. All disordered and folded fragments of one protein are paired with the folded regions of the other protein and vice versa for AF prediction. (B) Accuracy measured in motif RMSD compared to native structures for models obtained from fragmenting proteins from 20 DMIs from the positive reference dataset and comparison to model accuracy obtained when using (near) full length proteins for structure prediction (red crosses). Only models that meet the cutoff for identifying high confident models are shown. Six DMIs did not result in any such model. The gray horizontal line indicates the RMSD cutoff used to identify accurate models (see methods for details). (C) AF prediction outcome on 67 HuRI PPIs connecting NDD proteins. (D) PPI networks illustrating AF prediction outcomes and experimental retesting of PPIs in BRET assay. (E) Number of PPIs connecting NDD proteins with structural models at indicated pDockQ cutoffs from (Burke et al, 2023) grouped based on AF prediction outcomes using the fragmentation approach as shown in (C). (F) cBRET, total luminescence, and fluorescence for 28 PPIs connecting NDD proteins that were tested in the BRET assay. Luminescence and fluorescence measurements indicate expression levels of NL and mCit fusion proteins, respectively. Black horizontal lines indicate expression level and PPI detection cutoffs. The gray vertical line separates the detected (left) from undetected PPIs. Protein pairs in bold indicate those selected for interface validation via site-directed mutagenesis. Error bars indicate STD of three technical replicates. Source data are available online for this figure. fragmentation approach would lead to an increase in sensitivity but models for an additional 5 of the 20 DMI pairs. Applying the full also in false model predictions, we selected 20 out of the 31 DMI fragmentation approach onto all 20 DMI pairs resulted in accurate structures that were previously used to investigate the effect of model prediction for an additional 6 DMI pairs (Fig. 3B) fragment extension on prediction accuracies. We attempted model representing an increase in sensitivity for full length vs fragments prediction with the full length sequences of these 20 DMI pairs and from 5 to 60%. We then shuffled the 20 DMI pairs to generate 20 obtained a model for two of which only one met the motif interface random DMI pairs for which we performed the fragmentation pLDDT cutoff and corresponded to an accurate prediction approach. As expected from an earlier estimated 20% false positive (TRG_AP2beta_CARGO_1 in Fig. 3B; Dataset EV9, see methods rate (FPR) (Appendix Fig. S4A), 19 of the 20 random protein pairs for details). We then switched to using fragment extension step 5 had at least one fragment pair that produced a model above the for motifs and/or 2 for domains (Fig. 2A) and obtained accurate motif interface pLDDT cutoff (Appendix Fig. S6D; Dataset EV9) © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 81 109 AF prediction result Motif all atom RMSD Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al indicating that predictions done using this fragmentation approach Fig. S7B,C) indicating that PSMC5 might bind to ESRRG via this can substantially increase sensitivity while also producing a pocket but not with the predicted motifs. considerable number of false models using the established scoring AF predicted a coiled-coil interface between STX1B and VAMP2 metrics. This needs to be taken into account when modeling new of moderate confidence (Fig. 5A,B). STX1B is a close homolog to interactions with this fragmentation strategy, as covered in the STX1A, which binds in a 4-helix bundle to VAMP2 together with following section. SNAP25 in a 1:1:2 stoichiometry, respectively, as observed by We selected PPIs from HuRI that connect proteins associated crystallography (PDB:1N7S (Ernst and Brunger, 2003)). This with neurodevelopmental disorders (NDDs) and subjected these to structure together with our predictions suggest that STX1B might our AF fragmentation pipeline to predict putative DMIs and DDIs. bind VAMP2 in a similar way. Indeed, removal of the single helical For 51 out of 62 PPIs we obtained at least one structural model of SNARE domain in STX1B led to complete loss of binding to significant confidence (Fig. 3C,D). In retrospect, manual inspection VAMP2 (Fig. 5C; Appendix Fig. S8A,B). Interestingly, FBXO28 was of the predictions obtained for these PPIs revealed that, for 9 PPIs, predicted by AF to bind to STX1B via a similar coiled-coil interface a solved structure of the interface was already available. Reassur- involving an extended helix in FBXO28 and the SNARE domain in ingly, six out of these were accurately predicted by AF. For the STX1B (Fig. 5A,D). Here, deletion of the SNARE domain in STX1B remainder of the PPIs, 12, 16, and 14 resulted in a likely correct, or of the extended helix in FBXO28 reproducibly reduced, but did questionable, or likely wrong prediction, respectively, based on not abolish the interaction between STX1B and FBXO28 (Fig. 5E; manual inspection of the models (Fig. 3C,D; Dataset EV10). Likely Appendix Fig. S8C,D). We identified three pathogenic or likely wrong predictions were scored as such based on docking of the pathogenic mutations in the SNARE domain of STX1B in ClinVar protein partner into nucleic acid or metal ion binding or of which V216E and G226R are associated with generalized catalytically active sites. We also considered structural models as epilepsy with febrile seizures plus, type 9. Testing all three likely wrong, if different protein fragments of the partner were mutations in the BRET assay we observed a drastic decrease in predicted with similarly high scores to bind to the same pocket on binding for STX1B V216E to FBXO28 (Fig. 5F; Appendix Fig. the domain. More detailed information can be found in Methods S8C,D). However, the measured effects of the mutations on the and Appendix Text S1. Of note, for 8 of the 12 PPIs with a likely FBXO28-STX1B interaction do not correlate with their location at correct prediction, AF predictions performed using the full length the predicted interface. V216E, for example, is not predicted to be proteins (Burke et al, 2023) did not result in a high confidence in contact with residues of FBXO28 (Fig. 5D). This indicates that prediction (Fig. 3E). 28 of the 62 PPIs were in our hands amenable the actual predicted orientation of the two extended helices with to experimental testing using the BRET assay introduced earlier respect to each other is likely incorrect. (see Methods for details). Significant BRET signals were observed The fact that the deletion of the extended helix in FBXO28 or for 11 of these 28 PPIs (Fig. 3F). Of those, 7 PPIs were selected for the SNARE domain in STX1B reduced but did not abrogate binding validating the predicted interfaces (Fig. 3D,F). The remaining four of both proteins to each other (Fig. 5E) suggests that a secondary PPIs were not further considered because for three of them a interface might exist. Indeed, AF predicted additional interfaces structure already exists (CSNK2B-CSNK2A1, PNKP-XRCC4, between FBXO28 and STX1B involving folded and disordered UBA5-GABRAPL2) and for the fourth interaction (KCTD7- regions in both proteins (interfaces i and ii in Fig. 5A). Mutations CUL3) we classified the predicted interface as likely wrong. Next, designed to disrupt these interfaces partially confirmed the we will first describe failures in validating predicted interfaces involvement of some of these regions in binding as assayed with followed by the successes. BRET (Appendix Fig. S8E–H). In addition, the pathogenic For the interaction between PNKP and TRIM37, we obtained mutation R348L in FBXO28 predicted to be at interface ii seemed high confident structural models involving two different interfaces. to increase binding to STX1B (Appendix Fig. S8I–L). In summary, AF predicted the PNKP FHA domain to bind to several disordered our experimental data indicate that multiple regions of FBXO28 stretches in TRIM37 (Fig. 4A) that are overall negatively charged. and STX1B may be involved in the binding but the exact structural These short regions were predicted to bind to a pocket on the FHA details of this interaction remain to be elucidated. In the following domain that is known to bind phosphorylated threonines two sections, we will describe in more detail successful interface (Durocher et al, 2000), which led us to conclude that these validations for interactions involving PEX3, PEX19, and PEX16 as predictions were likely wrong. AF also predicted the MATH well as SNRPB and GIGYF1. domain of TRIM37 to bind to two separate disordered putative motifs located between the FHA domain and phosphatase domain PEX3, PEX19, and PEX16 in PNKP (Fig. 4A–C). However, none of the mutants aimed at disrupting the predicted interfaces (Fig. 4B) involving the MATH The interaction interface between PEX19 and PEX3 has been domain showed a decrease in BRET signal compared to wildtype structurally resolved before and consists of an interaction between (Fig. 4D; Appendix Fig. S7A) indicating that TRIM37 and PNKP do an N-terminal motif in PEX19 that binds to the cytosolic alpha- not interact with each other via this interface. helical domain of PEX3 (PDB:3MK4, (Schmidt et al, 2010)). Using AF predicted with high confidence binding of PSMC5 to the corresponding protein fragments, AF predicted a structural model hormone receptor domain of ESRRG via two distinct motifs that is highly similar to the solved structure (Fig. 5G; Appendix Fig. (Fig. 4E–G) with similarity to LxxLL motifs known to bind this type S9A,B). We introduced mutations in the PEX19 motif and PEX3 of domain (LIG_NRBOX in ELM DB). We reproducibly found that pocket (Appendix Fig. S9A) and found that F29K in the motif none of the motif mutations in PSMC5 decreased binding to weakened but clearly maintained BRET binding signals indicating ESRRG compared to wildtype while both domain pocket mutations the existence of a secondary binding site between both proteins led to a remarkable reduction in BRET signal (Fig. 4H; Appendix (Fig. 5H; Appendix Fig. S9C,D). Indeed, AF predictions with other 82 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 110 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology A 7 109 146 330 366 521 D PNKP FHA Phosphatase Kinase 81 i 80 ii 82 83 83 TRIM37 RING BB Helix MATH 80 93131 254 276 404 525-555 930-940 B C i N376 ii F328 S114 P112 RTPESQP TPLVSQDEKRDAELPKKRM E 223-234 F G 127 195 235 458 iii iv M453 M453 ESRRG ZnF Hormone receptor I280 I280 80 iii 91 iv 79 PSMC5 CC OB AAA domain + lid I401 L134 20 69 127 147 391 M138 132-141 399-406 DPLVSLMMVE MSIKKLWK H Motif iii mutated Motif iv mutated Figure 4. Verification of interface predictions for TRIM37-PNKP and ESRRG-PSMC5. (A) Schematic of the domain architecture of PNKP and TRIM37 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to structural models in (B) and (C). (B) Structural model of interface i shown in (A) with labeled residues that were mutated. (C) Structural model of interface ii shown in (A). (D) BRET titration curves are shown for wildtype interaction and mutants for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. The BRET trajectory could not be fitted because of an unusual saturation behavior (see methods for details). (E) Schematic of the domain architecture of ESRRG and PSMC5 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to structural models in (F) and (G). (F) Structural model of interface iii shown in (E) with labeled residues that were mutated. (G) Structural model of interface iv shown in (E). (H) BRET titration curves are shown for wildtype interaction and mutants of ESRRG-PSMC5 pairs for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. In panels (B), (C), (F), and (G) motif sequences are indicated at the bottom. Gray letters indicate residues not predicted to bind. Source data are available online for this figure. © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 83 111 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al A 1-22 23 183 237 287 C STX1B syntaxin domain SNARE domain i 51 iii 85 iv 92 59 ii 53 FBXO28 Fbox helix bundle extended helix VAMP2 synaptobrevin B 63 221 240-257 258 333343-360 26 116 iv D G226 V216iii S239 E F G I 13 45 46 367 vi H6 PEX3 TM Helical domain PEX16 vi 80 vii 91 v 94 81-88 PEX19 PMP-binding PEX16 TM domain 11-31 91-161 171 262 19 132 214 286 H5 H H4 PEX3 H3 H1H2 PEX19 J K L vii W189 R54 K169 E272 disordered fragments of PEX19 paired with the PEX3 domain and 5 to dock into the primary and secondary pocket, respectively resulted in highly confident models for interfaces involving a (Fig. 5G,I), supporting simultaneous interaction via both interfaces. binding pocket on PEX3 that is distal to the pocket where the While the interaction between PEX3 and PEX16 has been N-terminal PEX19 motif is known to bind. When using a protein described before, little is known about how both proteins interact fragment that spans the full disordered N-terminal region of PEX19 with each other. The monomeric AF model of PEX16 shows a (1–170), AF predicts the known PEX3-binding motif and helix 4 helical fold, which could in its entirety be transmembrane (TM). 84 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 112 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology Figure 5. Verification of interface predictions for STX1B-FBXO28, STX1B-VAMP2, PEX3-PEX19, and PEX3-PEX16. (A) Schematic of the domain architecture of STX1B, FBXO28, and VAMP2 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT (for order-disorder fragment pairs) or average interface pLDDT (for ordered-ordered fragment pairs) for the respective interface. Roman numbering refers to structural models in (B), (D), Appendix Fig. S8E, and Appendix Fig. S8I. (B) Structural model of interface iv shown in (A). In panel (B) and (D), the chains are color-coded according to the colors of the domains in (A). (C) BRET titration curves are shown for wildtype interactions and deletion constructs for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. (D) Structural model of interface iii shown in (A) with tested pathogenic mutations labeled and colored in green. (E, F) BRET titration curves are shown for wildtype interactions and deletion constructs for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. (G) Schematic of the domain architecture of PEX3, PEX19, and PEX16 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to structural models in (I), (J), and Appendix Fig. S9A. Region vi covers residues 1–170, which includes the previously reported N-terminal motif as well as three putative motifs suggested by the AF models. (H) BRET titration curves are shown for wildtype interaction and mutants of PEX3-PEX19 pairs for three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. The left plot displays mutants aimed at disrupting binding between PEX3-PEX19 while the right plot displays mutants aimed at disrupting the PEX3-PEX16 PPI why binding between PEX3-PEX19 should not be altered. (I) Superimposition of structural models of interface vi (PEX3-PEX19) and vii (PEX3-PEX16) on the PEX3 domain. Note that modeling smaller fragments of PEX19 generates alternative interactions with the binding sites. (J) Structural model of interface vii shown in (G). (K) BRET values with subtracted bleedthrough for PEX3-PEX16 wildtype and various mutated constructs. Three technical replicates are shown. (L) Proposed model for how the trimeric complex of PEX3, PEX19, and PEX16 might assemble at the peroxisomal membrane. Source data are available online for this figure. Between the putative TM helix 4 and 5 there is a large loop to the known PEX3-binding motif in PEX19 and a second one (132–214), which was predicted by AF with very high confidence to corresponding to a novel motif (residues 99–146) docking at a bind to a third pocket on the PEX3 domain, opposite to both hitherto unknown second binding site on PEX3 for PEX19. This binding sites mentioned earlier for PEX19 (Fig. 5G,I,J). Of note, model explains how PEX3 is anchored to the peroxisomal different fragments of this loop as well as the entire PEX16 were membrane via PEX16 and how PEX3 can bind very tightly repeatedly predicted to bind in similar modes to PEX3, further PEX19, which can then deliver PMPs to the peroxisome. Mutations increasing the confidence in this prediction. Encouraged by these in any of the three PEX proteins are associated with severe results, we submitted all three full length PEX sequences for developmental phenotypes referred to as peroxisome biogenesis complex prediction to AF and obtained a model that supports disorders (Fujiki et al, 2022). The vast majority of the around 150 simultaneous binding of PEX16 and PEX19 to PEX3 (Appendix mutations annotated for the three proteins are uncharacterized Fig. S9E). We individually mutated two residues in the PEX16 loop, (Henrie et al, 2018), dozens of which fall into the predicted deleted the loop in its entirety (del162-192), and mutated two interfaces. The structural models obtained from this work can residues on PEX3 (highlighted in Fig. 5J). Unfortunately, higher inform future studies aimed at characterizing the effects of these expression levels of PEX16 seem to trigger degradation of PEX3 mutations. (Appendix Fig. S9F), which we did not observe for the same constructs when co-expressed with PEX19 (Appendix Fig. S9G). As SNRPB and GIGYF1 a consequence, we could not obtain titration curves and BRET50 estimates but obtained reliable BRET signals for lower PEX3- AF predicted two different types of interfaces with high confidence PEX16 DNA transfection ratios showing that the deletion as well as for the interaction between SNRPB and GIGYF1. The first interface both PEX3 mutants significantly decreased binding to PEX16 involves the LSM domain of SNRPB which was predicted to bind to (Fig. 5K; Appendix Fig. S9H). Of note, these PEX3 mutants (R54S various fragments in the long disordered regions of GIGYF1 and E272R) did not alter binding to PEX19, showing that the (Fig. 6A). These regions do not display any common sequence overall structural integrity of PEX3 was not perturbed by these pattern. The structure of SNRPB has been resolved as part of the mutations (Fig. 5H; Appendix Fig. S9D). Sm ring complex that binds small nuclear RNA (PDB:4WZJ, PEX3 and PEX19 are peroxin proteins that regulate peroxisome (Leung et al, 2011)) showing that the surface on the LSM domain homeostasis. PEX16 is believed to serve as an integral membrane- predicted to bind to disordered fragments of GIGYF1, is actually bound receptor for PEX3 (Matsuzaki and Fujiki, 2008) while PEX3 engaged in binding LSM domains of other Sm proteins within the is thought to serve as a docking site for PEX19 (Fujiki et al, 2006). complex (Fig. 6B). We thus conclude that these predictions are PEX19 in turn is a cytosolic carrier for peroxisomal membrane likely wrong. The second type of interface predicted by AF involves proteins to the peroxisome (Fujiki et al, 2006). Combining results the GYF domain in GIGYF1 and multiple short disordered from previously published functional studies with the structural fragments in the C-terminal region of SNRPB, which repeatedly and experimental results obtained in this study, a model for a carry the sequence PPPGM(R) (Fig. 6A,C). We designed various trimeric complex between PEX3, PEX19, and PEX16 emerges deletion constructs of SNRPB that would gradually remove more (Fig. 5L) where PEX16 fully inserts into the peroxisome membrane and more of the repeated proline-rich motif. We observed, using via a fold that consists of seven helices (residues 19-286) with its the BRET assay, that these deletion constructs gradually decreased N-terminal end being cytosolic and its C-terminal end protruding binding to GIGYF1 (Fig. 6D; Appendix Fig. S10A,B). We also into the peroxisome. The extended loop between TM helix 4 and 5 mutated the GYF domain pocket and found that W498E but not reaches into the cytosol and docks onto PEX3, which is further L508F would decrease binding to SNRPB (Fig. 6D,E; Appendix Fig. anchored into the peroxisomal membrane via its N-terminal TM S10A–D). To further corroborate these findings we performed a co- helix (residues 13–45). PEX19 docks onto PEX3, opposite to where immunoprecipitation experiment, where endogenous GIGYF1 PEX16 is bound, via two interaction surfaces—one corresponding interacted with HA-tagged full length SNRPB (Fig. 6F). This © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 85 113 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al A B 7 85 240 80-88 ii SNRPB LSM i 77-91 ii 80-82 GIGYF1 GYF 1-11 351-361 476 535 606-616 883-893 426-436 556-566 C i (4WZJ) D W498 L508 PPPGMRPPRP F 5% Input HA-IP G kDa E 55 HA (SNRPB) 40 170 130 GIGYF1 100 40 GAPDH (control) Snrpb Snrpb RPPPGLTN (7RUQ) Figure 6. Verification of interface predictions for SNRPB-GIGYF1. (A) Schematic of the domain architecture of SNRPB and GIGYF1 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to structural models in (B) and (C). (B) Structural model of interface ii shown in (A) (left) and in comparison a solved structure (PDB:4WZJ) of the Sm ring complex (right) bound to RNA (orange). The LSM domain of SNRPB is shown in cyan. The position of the predicted motif (left) or neighboring LSM domain of SNRPD3 (right) are indicated in gold. Black circles indicate the predicted interface in the model and corresponding interface in the complex on the LSM domain of SNRPB. (C) Structural model of interface i shown in (A) with tested domain mutations labeled and colored green. The motif sequence is indicated at the bottom. (D, E) BRET titration curves are shown for wildtype interactions, deletion constructs of SNRPB, and single point mutants in GIGYF1 for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. (F) Cropped immunoblot of input (5%) and HA antibody immunoprecipitation (IP) performed in parental HEK cells (empty, untagged negative control), Snrpb(full-length, 1-231)-2xHA-mNeonGreen, Snrpb(1-190)-2xHA-mNeonGreen expressed from a single locus in Flp-In™ T-REx™ 293 Cell Lines. The HA antibody was used for detecting the immunoprecipitated Snrpb-proteins, endogenous GIGYF1 was detected with GIGYF1 antibody, GAPDH serves as a loading and negative-IP control. The experiment was performed twice with equivalent outcome, one representative experiment is shown. (G) Solved structure (PDB:7RUQ) of the GYF domain of GIGYF1 bound to a proline-rich motif in TNRC6C. The sequence of the motif in TNRC6C is indicated. Source data are available online for this figure. interaction appeared less pronounced upon truncation of the charged residues establishing important contacts with the domain C-terminal proline-containing region of SNRPB (Fig. 6F). This (PDB:1L2Z, (Freund et al, 2002)). This structure formed the basis further suggests that both proteins interact with each other in cells for the definition of the LIG_GYF motif class in the ELM DB. The and that this interaction is stabilized by the predicted interface. recently resolved structure of the GYF domain of GIGYF1 together During the course of these studies, a structure was published with our structural models and experimental validations argue for (PDB:7RUQ, Sobti et al, 2023) showing binding of the GYF domain an extension of the existing motif definition or definition of a new of GIGYF1 to a motif of sequence PPPGL of the protein TNRC6C motif subclass. confirming the binding mode predicted by AF where a hydrophobic residue (M or L) inserts into a hydrophobic pocket and where the proline residues contact the surrounding domain surface Discussion (Fig. 6C,G). Interestingly, this hydrophobic pocket does not exist in the previously solved structure of the GYF domain of CDBP2 AF has revolutionized the field of structural bioinformatics and has binding to a proline-rich peptide that is flanked by positively sparked much excitement about its potential to predict structures of 86 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 114 empty 1-231 1-190 empty 1-231 1-190 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology interacting proteins and bringing us closer to a structurally resolved disordered regions generally decrease AF prediction accuracies as protein interactome. However, from existing studies it largely also reported in a recent preprint (preprint:Bret et al, 2023). remained unclear whether AF’s performance depends on the type Furthermore, optimal cutoffs for various metrics such as the model of interfaces and the length of submitted protein chains for confidence decreased when using longer protein fragments, making interface prediction, which metrics perform best in identifying them less robust for interface prediction with AF. When evaluating likely correct structural models of interfaces, how specific AF performance differences for longer and shorter protein fragments predictions are, and to which extent highly confident structural we identified three DMI pairs involving the motif classes models can be experimentally corroborated. In this study, we DEG_APCC_KENBOX_2, LIG_Pex14_3, and LIG_GYF, for showed that AF performs similarly well for interfaces between which, during fragment extension, a second known motif folded domains and interfaces formed between a folded domain occurrence was added to the fragment. This second motif was and a short linear motif. Using minimal interacting regions for selected by AF during interface prediction, displacing the original interface prediction we reached sensitivities of up to 80% similar to motif and leading to a high RMSD score. We removed these previously published work (Tsaban et al, 2022; Johansson-Åkhe instances from the dataset when evaluating AF’s performance on et al, 2021). We thoroughly investigated AF’s FPR using random fragment extension but they point to biologically correct variability domain-motif pairs and found it to be around 20%. However, in AF prediction outcomes due to existing multivalency of many asking AF to discriminate binders from non-binders when motif DMIs in protein interactions. Other work suggested that AF is able sequences carried one disruptive mutation, we found that to select the stronger binder among two motif occurrences (Chang prediction accuracies were close to random. This points to an and Perez, 2023), which might at least in some cases guide AF important limitation in AF’s ability to predict binding specificities motif selections. However, in other cases this motif preference and is in line with previous reports on AF’s inability to predict the might also hinder discovery of multivalency in PPIs. For example, effect of mutations (Buel and Walters, 2022). Comparison of the use of smaller protein fragments for the protein pair SNRPB different metrics to discriminate good from bad structural models and GIGYF1 enabled the discovery of a proline-rich repeat motif using either minimal interacting fragments or extensions revealed in SNRPB. the average interface pLDDT for DDI models and the motif In comparison to predictions made using full length proteins interface pLDDT for DMI models to be the most robust and best (Burke et al, 2023) we found that protein fragmentation increased performing metrics. However, when manually inspecting AF the probability of obtaining a high confidence interface prediction, predictions we found it useful to also consider AF’s model especially for cases involving proteins with long disordered regions confidence, suggesting that in the future a combination of different such as GIGYF1. For smaller and more globular proteins like the metrics might be even more powerful to discriminate good from PEX proteins studied above, full length predictions can identify the bad structural models. The alignment depth has been previously right binding sites but these can be further substantiated by reported to somewhat influence model accuracy (Bryant et al, running additional predictions with smaller fragments. The 2022). While this feature was not investigated here, it might serve fragmentation approach increases the number of prediction runs as a pre-filter to identify PPIs of high conservation for which per protein pair from one to a couple hundred, depending on the structural modeling will likely be more successful. Interestingly, the length and modularity of both proteins. The vast majority of these number of residues or atoms predicted to be in contact with each fragment pairs should not interact. With a FPR of 20%, this means other was poorly predictive, in contrast to a previous report (Bryant that more actual non-interacting than truly interacting fragment et al, 2022), confirming our observations that the tested AF versions pairs will result in a high confidence prediction. A big challenge is in this study will always put both chains in contact with each other thus to identify likely correct interface predictions among the many to create atomic contacts, and from visual inspection alone it is very false ones. This is also illustrated by the prediction results that we challenging to tell good from bad structural models apart. Of note, obtained for the seven protein pairs that we followed up observed differences in AF performance across studies likely experimentally. Clearly, AF’s general limited specificity contributes originate both from using different benchmark datasets and to these false predictions. We observed that additional sources of different AF versions. Our study is unique in that it assesses error can arise from exposed intramolecular binding sites resulting multiple metrics on two different classes of interfaces, DMIs and from fragmentation, incorrectly designed boundaries of folded DDIs, using two different AF versions. More work is needed to regions, and docking of protein fragments into enzymatic pockets develop benchmark datasets of coiled-coil and disorder-disorder of metabolic enzymes or sites for metal ion, DNA, or RNA binding. interfaces to also evaluate AF’s performance for these modes of It seems that AF is overall well suited to find binding pockets on binding. Of note, our benchmark datasets almost exclusively folded domains. However, our work also clearly demonstrates that consisted of structures that AF has seen in the training process. AF is able to correctly dock the matching partner structure into Interestingly, benchmark studies done with unseen structures these pockets without the need for a pre-existence of both partner reported similar sensitivities (preprint:Bret et al, 2023) indicating structures in the bound conformation contrary to other state-of- that AF is not strongly biased towards structures it has seen before. the-art docking algorithms. AF’s high sensitivity with respect to We extensively explored the influence of protein fragment intramolecular binding sites and wrongly fragmented folded length on AF’s performance and found that slight extensions of regions will make it particularly hard to fully automate the minimal motif sequences can improve prediction accuracies. fragment design process. Despite these challenges we found that Inspection of individual cases revealed novel information on recurrent interface predictions from overlapping fragments can important motif sequence context that was so far missing in help gain confidence in predictions, as also highlighted in a recent corresponding motif entries at the ELM DB. However, longer study (Bronkhorst et al, 2023), since we rarely observed this disordered fragments or fragments containing ordered and large recurrence for likely wrong predictions. © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 87 115 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al Given the reported uncertainties in AF predictions, even for annotated as true positives (Kumar et al, 2022). The structures high confidence cutoffs, experimental validation is essential. The were subject to a series of manual inspections to check their BRET assay used here has been shown in previous studies to be validity for further analysis. First, since AlphaFold can only model sensitive enough to quantify weakening of binding introduced by the 20 standard amino acids, we excluded any structures with point mutations and to detect motif-mediated PPIs (Ebersberger post-translational modifications in the motif. Second, structures et al, 2023; Trepte et al, 2018; Mo et al, 2022). Using the BRET that do not resolve all of the residues in a motif as curated by ELM assay, we were able to detect 11 out of 28 PPIs from the HuRI DB were excluded. Third, we restrict our studies to only binary dataset. This retest rate is actually higher compared to retest rates interactions, so DMIs that require more than two proteins to form of gold standard PPI datasets used in the past to benchmark various the binding interface were excluded. Likewise, DMIs with only binary PPI assays including this BRET assay, attesting the overall intramolecular interaction evidence were excluded. We manually detectability of PPIs from HuRI (Braun et al, 2009; Trepte et al, annotated the boundaries of the domains by visual inspection of 2018; Choi et al, 2019). The NL and mCit fusions used in the BRET the structures. After this filtering, we identified 136 structures assay allowed us to monitor the expression levels of wildtype and from distinct ELM classes that formed our DMI benchmark mutant constructs, which is important to rule out loss of binding dataset (Dataset EV2). because of a destabilization of the protein. However, we cannot exclude the possibility that some expressed mutants might still be Sequence identity of the domains in the DMI benchmark dataset partially unfolded or mislocalized and thus, some loss of binding We took all the binding domains in the DMI benchmark dataset detected in our study could be unspecific and not the result of a and computed their pairwise sequence identity from a global specific perturbation of the predicted interface. Furthermore, alignment without gap penalties. Matching residues were given a preservation of binding observed for some other mutants at the score of 1, otherwise 0. The sum of these scores was divided by the predicted interface might result from the mutations not being length of the longer sequence to compute the sequence identity. disruptive enough and thus, do not necessarily disprove the predicted interface. Selection of structures for the DDI benchmark dataset Despite these limitations, we were able to assess the validity of seven interface predictions using experimentation. We discovered a We randomly selected 80 pairs of Pfam domain types that were likely novel DMI type that mediates binding between PEX3 and described in the 3did resource (Mosca et al, 2014) to be in contact PEX16, and proposed a model for how PEX3, PEX16, and PEX19 with each other in solved structures in the Protein Data Bank form a trimeric complex at the peroxisomal membrane. We also (PDB). We manually inspected all PDB entries listed to contain validated a variation of the LIG_GYF motif class in SNRPB that contacts between instances of a given Pfam domain pair until we mediates binding to GIGYF1 thereby potentially connecting mRNA found one that we considered a genuine domain-domain interac- splicing with posttranscriptional control mechanisms. These results tion. These decisions were primarily based on the number of atomic confirm in principle that AF is able to predict novel interface types contacts observed and the validity that two folded domains were and that it can be used to extend existing interface type definitions. interacting with each other. Out of the 80 selected Pfam domain However, our experimental results also highlight clear limitations pairs, we identified 48 DDI types and 48 corresponding approved of AF predictions. Our data suggests that FBXO28 and STX1B as DDI structural instances that we selected for the DDI benchmark well as STX1B and VAMP2 interact via coiled-coil interfaces but dataset. The sequences of the minimal interacting domain regions likely at higher stoichiometries and different conformations than were manually annotated by visual inspection of the structures and predicted. We confirmed the binding pocket in ESRRG but not the used for prediction. A more detailed description of the curation predicted interfaces in PSMC5 and we could not substantiate procedure and information on the pairs will be soon published interface predictions for TRIM37 and PNKP. Highly confident elsewhere (Geist et al, in preparation). interface predictions were obtained for seven additional PPIs that await experimental validation. In summary, we provided experi- Generation of random reference sets with minimal mental evidence and structural information for PPIs whose interacting regions disruption is likely associated with neurodevelopmental disorders. This information can be explored in future studies aimed at Mutating motif sequences delineating potential molecular mechanisms causing disease. Our Key conserved residues of the motifs in the DMI benchmark dataset study furthermore laid out clear limitations, perspectives, and were identified computationally using the regular expression of the future needs in AI-based structure prediction to bring us closer to a corresponding ELM class in the ELM DB and SLiMSearch fully structurally annotated human protein interactome. (Krystkowiak and Davey, 2017). The defined positions are any positions in the regular expression that are not wildcards. To mutate the key residues to the ones with opposite physico-chemical Methods properties, we substituted one or two key residues with the ones that are of the largest Miyata distance (Miyata et al, 1979) (Dataset Selection of structures for DMI benchmark dataset EV2). To gather a list of ELM classes with structural evidence and Randomizing pairings of known domain-motif interfaces annotate their minimal interacting fragments, we downloaded a To simulate non-binding domain-motif pairs, we randomized the dataset of solved structures of all ELM classes from ELM DB on pairings of known domain motif interfaces. As some domain types 08.10.2021 (ELM class version 1.4) for instances that are can bind to motifs from distinct ELM classes, we manually checked 88 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 116 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology that the randomized pairings did not coincide with actual domain- This resulted in 563 and 540 predictions from the positive reference motif interface types (Dataset EV2). set extensions for AF v2.2 and v2.3, respectively. Randomizing pairings of known domain-domain interfaces Selection of reference datasets for comparison of AF The pairings between known domain-domain interfaces were v2.2 with v2.3 randomized to form the random reference set for DDIs. All predictions for the minimal DMIs and the random DMIs Generation of positive DMI reference set with involving minimal fragments were successfully modeled by both fragment extensions versions of AF. Some extensions from the positive reference set were not successfully modeled by AF v2.2 and v2.3 due to failure Among the 136 solved structures that we selected previously, we from HHblits. To compare AF v2.2 with v2.3, we used only further filtered for structures that consist of only human proteins. predictions that were successfully modeled by both versions of AF. To test the potential effect of extension on DMIs that were This resulted in 616 predictions from the extensions of the positive predicted with different accuracies in their minimal forms, we reference set. selected 12 DMI types from the correct sidechain category, 8 DMI types from the correct backbone category and 11 DMI types from Evaluation of AF sensitivity and specificity when using the correct pocket category as determined using the motif RMSD the fragmentation approach calculation. In total, 31 DMI types were selected for extension. Three additional DMI types were originally selected but later on Among the 34 DMIs selected for extension, we further selected 20 discarded because they contained secondary motif occurrences DMIs and retrieved the PPIs mediating these DMIs as the PRS and complicating data analysis. The extensions were done on the randomized their pairing to form random domain-motif protein canonical sequence of the proteins used to solve the structure. pairs as the RRS. The 20 PPIs from the PRS and the 20 protein pairs Motif extension 1 extended the motif sequence at both N and C from the RRS were subjected to the fragmentation approach, termini by n residues where n is the length of the known motif. generating 8943 fragment pairs and 11,045 fragment pairs for the Motif extension 2 further extended the motif sequence by another n PRS and RRS, respectively. All fragment pairs from the PRS and all residues at both termini. Motif extension 3 and 4 each extended the but one fragment pair from the RRS resulted in an AlphaFold motif sequence by 2n residues at both termini. Motif extension 5 model. Models were deemed highly confident, if the disordered extended the motif sequence by including neighboring domains fragment had a motif interface pLDDT of ≥70 or, in case of and motif extension 6 used the full-length protein sequence. On the ordered-ordered models, the average interface pLDDT scored ≥70. domain side, domain extension 1 extended the domain sequence to To evaluate the sensitivity of the fragmentation approach, we include the disordered regions N- and C-terminally of the binding considered all models that met the above mentioned cutoffs and domain until it reached neighboring domain(s) boundaries. which contained the motif and domain sequence. We super- Domain extension 2 included the sequence region of the imposed the models onto the corresponding native structures using neighboring domains and domain extension 3 used the full- the minimal domain and computed the RMSD between the length protein sequence. In cases where the known motif or binding minimal motif residues in the native and modeled structure. A domain is at the C terminus, we extended the motif or domain model was deemed accurate if the motif RMSD was ≤5 Å. At this sequence on only the N terminus and vice versa. There were some cutoff the backbone of the native and modeled motif are well cases where the last extension steps, motif extension 6 and domain aligned but not necessarily their side chains (see also RMSD extension 3, extended the protein minimally (<20 residues N or C subsection below). We repeated the same procedure for each DMI terminal to the previous extension step). These cases were excluded protein pair using full length sequences as input into AF for from the analysis. The dataset of extended DMIs is in Dataset EV5. modeling. In 18 cases AF did not return a model when using full In total, 709 fragment pairs were submitted to AlphaFold. From length sequences. Here, we used the largest protein fragments these, 632 and 616 were successfully modeled by AF v2.2 and v2.3, instead for which AF returned a model. Information on the protein respectively. pairs, prediction results, and statistics is available in Dataset EV9. Generation of random DMI reference set with AlphaFold versions and runs fragment extensions We used local installations of AlphaFold Multimer version 2.2.0 To generate a random reference set using the extensions, we and 2.3.0 (preprint:Evans et al, 2021) for all protein complex randomized the pairings of the 34 DMI types that we selected for predictions with the following parameters: extensions and paired their extensions for prediction. Motif --max_template_date=2020-05-14 extension 6 and domain extension 3 were excluded from the --db_preset=full_dbs pairing. The dataset of DMIs with random pairings and their --use_gpu_relax=False extensions can be found in Dataset EV6. In total, 612 predictions For every AlphaFold run, five models were predicted with single were generated, among which 566 and 522 predictions were seed per model by setting the following parameter: successfully modeled by AF v2.2 and v2.3, respectively. Since motif --num_multimer_predictions_per_model=1 extension 6 and domain extension 3 were excluded from the The databases queried during AlphaFold predictions were random reference set using the extensions, we also excluded them specified following the instructions from the github page of from the positive reference set extensions during ROC analysis. AlphaFold © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 89 117 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al (https://github.com/deepmind/alphafold#running-alphafold): motif chain in AlphaFold models and the motif chain in the solved For running AlphaFold Multimer v2.2, the following databases structure. To ensure that the RMSD calculation was done based on all were queried: atom identifiers and without any outlier rejection refinement, the --bfd_database_path=bfd_metaclust_clu_complete_id30_c90_- arguments of the rms_cur command, matchmaker and cycles, were set final_seq.sorted_opt to 0. Prediction accuracy categories were defined based on motif RMSD --mgnify_database_path=alphafold_v220_databases/ cutoffs: RMSD ≤ 2 Å for correct sidechain, between 2 Å and 5 Å for mgy_clusters_2018_12.fa correct backbone, between 5Å and 15 Å for correct pocket and >15 Å --obsolete_pdbs_path=alphafold_v220_databases/pdb_mmcif/ for wrong pocket. obsolete.dat --pdb_seqres_database_path=alphafold_v220_databases/ DockQ pdb_seqres/pdb_seqres.txt The calculation of DockQ scores of AlphaFold models was done in --template_mmcif_dir=alphafold_v220_databases/pdb_mmcif/ reference to their solved structures using the code available on the mmcif_files github repository of DockQ (https://github.com/bjornwallner/ --uniprot_database_path=alphafold_v220_databases/uniprot/ DockQ, (Basu and Wallner, 2016). DockQ classification was done uniprot.fasta using the cutoffs provided by DockQ (DockQ: <0.23 for incorrect, --uniclust30_database_path=alphafold_v220_databases/uni- between 0.23 and 0.49 for acceptable, between 0.49 and 0.80 for clust30/uniclust30_2018_08/uniclust30_2018_08 medium and ≥0.80 for high). --uniref90_database_path=alphafold_v220_databases/uniref90/ uniref90.fasta pDockQ For running AlphaFold Multimer v2.3, the following databases The calculation of pDockQ of AlphaFold models was done by were queried: adapting the code available on the github repository from the --bfd_database_path=alphafold_v230_databases/bfd/ Elofsson lab (https://gitlab.com/ElofssonLab/FoldDock/-/blob/ bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt main/src/pdockq.py, (Bryant et al, 2022)). The pDockQ score is --mgnify_database_path=alphafold_v230_databases/mgnify/ created by fitting a sigmoidal curve to the DockQ scores of a series mgy_clusters_2022_05.fa of AlphaFold predicted models. The score takes into account the --obsolete_pdbs_path=alphafold_v230_databases/pdb_mmcif/ number of interface contacts as well as their pLDDT scores. Of obsolete.dat note, the calculation of pDockQ score takes Cβs (Cα for glycine) --pdb_seqres_database_path=alphafold_v230_databases/ from different chains within 8 Å from each other as interface pdb_seqres/pdb_seqres.txt contacts which is different from our interface definition (see the --template_mmcif_dir=alphafold_v230_databases/pdb_mmcif/ subsection below Domain chain and motif chain interface pLDDT mmcif_files and average interface pLDDT). --uniprot_database_path=alphafold_v230_databases/uniprot/ uniprot.fasta iPAE --uniref30_database_path=alphafold_v230_databases/uniref30/ The calculation of iPAE of AlphaFold models was done by adapting UniRef30_2021_03 code available on the github repository https://github.com/fteufel/ --uniref90_database_path=alphafold_v230_databases/uniref90/ alphafold-peptide-receptors/tree/main (Teufel et al, 2023). The iPAE is uniref90.fasta the median predicted aligned error at the interface. The authors To test the effect of template use on prediction accuracy, the consider residues in contact if their distance is below 0.35 nm (3.5 Å). following parameter setting was used to switch off the use of The iPAE score could not be calculated for models generated by templates during the prediction: AlphaFold Multimer version 2.3.0 due to JAX dependency of the pickle --max_template_date=1950-01-01 files generated by AlphaFold Multimer version 2.3.0. For the fragmentation approach, the multiple sequence align- ments (MSAs) of a given protein fragment can be reused in Model confidence subsequent runs where the same fragment is involved. The MSAs The model confidence of AlphaFold models was extracted from the were first moved to the prediction output folder and the following ranking_debug json file. The model confidence is a weighted parameter was added to enable the reuse of MSAs. combination of pTM and ipTM to account for both intra- and --use_precomputed_msas=True interchain confidence: For efficient computing, we segregated the MSA generation part by using only the CPUs and the model fitting part using the GPUs. model confidence ¼ 0:8 " ipTM þ 0:2 " pTM Calculation of metrics for structural models Domain chain and motif chain interface pLDDT and average Motif RMSD interface pLDDT We used the software PyMOL (TM) Molecular Graphics System, Since AlphaFold conveniently stores the pLDDT confidence Version 2.5.0. Copyright (c) Schrodinger, LLC., for the superimposition measure for each residue in the B-factor field of the output PDB of AlphaFold models with corresponding solved structures. First, we files, the pLDDT of residues at the interface was parsed from the used the align command to align the domain chain in AlphaFold output PDB files of AlphaFold. Residues at the interface are defined models with the domain chain in the solved structure. Then, we used as those that have at least one heavy atom that is less than 5 Å away the rms_cur command to calculate the all-atom RMSD between the from any heavy atom of the other chain (calculated using the 90 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 118 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology PyMOL API). The pLDDT of the residues at the interface from the AF prediction. The ELM instances were extended at both N and C domain chain and motif chain was averaged to compute the termini by n residues where n is the length of the ELM instance, domain chain and motif chain interface pLDDT, respectively. The according to the benchmarking results. The minimal binding domains pLDDT of all the residues from both chains was averaged to of the ELM instances were detected in the interaction partner using compute the average interface pLDDT. Pfam HMMs (Mistry et al, 2021). As the domain boundaries detected by Pfam HMMs could be inaccurate, we also extended the domain Residue-residue and atom-atom contacts sequence at the N and C terminus by 20 residues to ensure that the Following the interface definition above, the number of unique whole folded region was covered. The predictions were performed using residue-residue and atom-atom contacts were also quantified as AF version 2.3.0. To select a subset of these motif classes, where we can measurements to assess AlphaFold models. do experimental testing, we also used the InParanoid resource (Persson & Sonnhammer, 2023) to map ELM instances where both proteins are Mean DockQ between predicted models from mouse to their human orthologs. To verify that they indeed do not The top five models generated by AF, determined based on their have structural homologues in the PDB, we both used the SIFTS model confidence, were considered for computing this metric. To mapping (Dana et al, 2019) between the Pfam domain in ELM and the quantify the similarity among the models, we computed DockQ PDB and also looked at the ELM classes that were listed as homologs on scores between all possible pairs of models by taking the higher the ELM website. ranked model as the “template” model and lower ranked model as the “predicted” model. The mean of these DockQ scores is taken as Evaluation of effect of fragment extensions on AF the similarity among the models in a given prediction. This prediction accuracies calculation was done for AF models of minimal DMIs and their randomizations for ROC analysis. The data were stored in Dataset We superimposed the AF models generated with DMI extensions EV2. onto the corresponding solved DMI structures to quantify AF prediction accuracy using motif RMSD calculations. To this end, Quantification of motif properties we aligned the two structures on their minimal binding domains and calculated the all-atom RMSD between the minimal motif in Motif hydropathy score and symmetry score the extension AF model and the minimal motif in the solved By referring to the Kyte-Doolittle hydrophobicity scale, (Kyte & structure. To determine potential differences in DMI prediction Doolittle, 1982) the hydropathy scores of the amino acids in a given accuracy when using minimal versus extended protein fragments, motif were summed and averaged to compute the average we computed the log2 fold change of the all-atom motif RMSD hydropathy of the motif. The average motif symmetry score was before and after extension. computed by taking the sum of the absolute difference of ! " all atom RMSD motif hydropathy scores between motif position n and n - motif length Fold change in prediction accuracy ¼ log minimal DMI2 all atom RMSD motif + extended DMI1 and division of this sum by half of the motif length: Pa jðH $H Þj Peptide symmetry score ¼ n¼1 n x$nþ1 a Fragment design and fragment pairing for fragmentation approach where x is the length of the motif and a is the floor division of x by 2. We first inspected the monomeric structural models from the AlphaFold database (Varadi et al, 2022; Jumper et al, 2021) of both Motif probability interacting proteins to determine the boundaries of their ordered The motif probability reflects the degeneracy of a given motif class and coiled-coil regions, which were also treated as “ordered”. All as quantified by its regular expression that is annotated in the ELM regions that were not annotated as ordered were annotated as DB. The motif probability was retrieved from the ELM DB disordered. In some cases, an extended loop with low pLDDT can version 1.4. be found within an ordered region. As they can also potentially carry a motif or mediate interactions in another way, these regions Secondary structure elements of motifs were also annotated as disordered in addition to their annotation as We extracted the secondary structure elements of motifs using the being part of a larger ordered region. The disordered regions of the PyMOL API. In cases where the motif adopts partial secondary proteins were fragmented into fragment sizes of 10, 20 and 30 structure, such as loop-helix-loop or loop-strand-loop, they are residues. To allow AF to sample continuous sequences, we also treated as helical or strand, respectively. generated another set of fragments of same sizes that overlap with the previous fragments by sliding the sequence by half the size of Selection of motif classes from ELM DB without the fragment. The unfragmented disordered regions, as well as their annotated structural instances and prediction with AF fragments, from one protein were then paired with the ordered regions from its interacting partner and vice versa for prediction. By querying the ELM DB for all ELM classes, we retrieved a list of ELM The ordered regions from both proteins were also paired for classes and the number of instances with a structure solved (column prediction. We decided to manually define boundaries between #instances_in_PDB). We filtered for ELM classes with 0 instance- ordered and disordered regions because testing available code s_in_PDB and selected 205 instances out of the filtered ELM classes for developed for this purpose, like clustering using the PAE matrix, © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 91 119 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al turned out to be too inaccurate. We observed that erroneous Based on clone availability, we selected 49 of the 62 PPIs for removal of residues close to the domain borders that are still experimental validation of the predicted interfaces using the BRET contributing to the folding of a structured domain, can heavily assay. For 30 of the 49 selected PPIs for experimental testing we mislead AF predictions. obtained sequence-confirmed clones with luciferase and mCitrine fusions. For 28 of these PPIs both partners were expressed in our Selection of NDD proteins experimental system as determined by total luminescence and fluorescence measurements (Fig. 3D,F). A list of NDD genes was assembled using whole exome and whole genome sequencing studies of cohorts of NDD patients from Softwares used Gene4Denovo (Zhao et al, 2020) and Deciphering Developmental Disorders (DDD) study (Firth et al, 2011), respectively. From We used the software PyMOL (TM) Molecular Graphics System, Gene4Denovo, we selected genes linked to autism-spectrum Version 2.5.0. Copyright (c) Schrodinger, LLC., for the visualization disorders (ASD), intellectual disability (ID), epilepsy (EE), and superimposition of AlphaFold models. undiagnosed developmental disorders (UDD) and NDDs in All codes were written in Python3 and analyses were done using general. Genes with non-coding mutations as well as genes with a Jupyter notebooks. We used the Python libraries, Biopython (Cock false discovery rate (FDR) >= 0.05 were excluded. Similarly, in the et al, 2009) for sequence similarity computation, pandas (McKin- DDD study, genes associated with developmental disorders with a ney, 2010) for data analysis, and Matplotlib (Hunter, 2007) and neurological component, as well as genes found to be mutated in at seaborn (Waskom, 2021) for data visualization. ROC and PR least three children with NDDs (labeled as confirmed genes) were statistics were calculated using the Python package sci-kit learn retained. The final list included 984 NDD-risk genes. We filtered (Pedregosa et al, 2012). the HuRI network (Luck et al, 2020) for interactions mediated exclusively by proteins from this NDD gene list resulting in 67 PPIs Cell line culture and maintenance excluding self-interactions. Since our fragmentation approach generates many fragments, we did not consider PPIs involving HEK293 cells were purchased from DSMZ (catalog number ACC305). proteins that are more than 1500 amino acids in length, resulting in These cells were grown and maintained in DMEM (Thermo Fisher), a final list of 62 PPIs that were subjected to AF modeling. supplemented with 10% FBS (PAN-Biotech), 2mM glutamine (Thermo Fisher) and 1% penicillin–streptomycin (Thermo Fisher). Cells were Manual inspection of interface predictions for NDD-NDD incubated at 37 °C with 5% CO2. Subcultivation was performed with PPIs and selection for experimental validation 1ml of 0.05% trypsin every 2–3 days for up to 40 passages. For each passage 1–2 × 106 cells were seeded in T25 flasks (Sarstedt). Then, new Paired fragments from NDD-NDD PPIs were predicted using AF cells were thawed from stocks containing 2 × 106 cells in 1ml of growth version 2.2 and the prediction results are stored in Dataset EV10. medium, supplemented with 10% DMSO (Sigma). Every 3 months cells Based on our benchmarking results, we started by manually were checked for mycoplasma contamination using a PCR test (Dataset inspecting all NDD-NDD PPIs that obtained at least one structural EV11). The cell line was purchased from DSMZ four years ago, model with either a motif chain interface pLDDT of ≥70 for the expanded, aliquoted, and frozen. A new aliquot is thawed after every 40 disordered fragment or with an average interface pLDDT ≥ 70 for passages. No further authentication of the cell line has been done. structural models with predicted ordered-ordered interfaces (DDIs). However, during the course of these manual inspections, Plasmid construction we found that using in addition a model confidence of ≥0.7 for ordered-ordered fragment pairs helped discriminating good from Standard controls bad structural models. We inspected the ranked_0 models for all The donor and acceptor vectors pcDNA3.1-cmyc-NL-GW fragment pairs that met the above cutoffs but also inspected models (Addgene plasmid ID #113446), pcDNA3.1-GW-NL-cmyc scoring somewhat below these cutoffs. For every NDD-NDD PPI (Addgene plasmid ID #113447), pcDNA3.1 GW-His3C-mCit, we used Interactome3D (Mosca et al, 2013) and PDB database pcDNA3.1 mCit-His3C-GW as well as controls pcDNA3.1-NL- searches (https://www.rcsb.org/ (Berman et al, 2000)) to identify cmyc (Addgene plasmid ID #113442), pcDNA3.1-PA-mCit whether a structure already existed for this PPI. In our evaluation of (Addgene plasmid ID #113443) were kindly provided by the the structural models we also considered if a certain interface was Wanker Group (Max-Delbrück-Centrum für Molekulare Medizin, recurrently predicted for different overlapping fragments because Germany) (Dataset EV12). By default we cloned all ORFs of this usually hints at increased confidences for the correctness of the interest into N-terminal NL and mCit fusion destination vectors interface prediction. We furthermore explored the number and and occasionally also transferred ORFs into C-terminal fusion kind of residue-residue contacts predicted by AF by visual vectors if N-terminal fusions did not result in sufficient BRET inspection of the structural models using PyMol. We searched for signals but the interaction was of high interest to this study and functional annotations and existing structures for the monomers predicted interfaces were closer to the C-terminus. Trepte et al have using the PDB, ProViz (Jehl et al, 2016), SMART (Letunic et al, shown that testing protein pairs in different configurations 2021), and the scientific literature to identify enzymatic pockets or increases detection rates while maintaining low false detection binding interfaces for DNA, RNA, or metal ions. Observations and rates and that BRET signals are higher if fusions are close to the justifications for the final evaluation of the predictions for every actual interaction interface (Trepte et al, 2018; preprint:Trepte et al, NDD-NDD PPI are provided in Appendix Supplementary Text S1. 2021; preprint:Trepte et al, 2023). 92 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 120 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology GATEWAY cloning procedure 6. The primer ideally should start and end with guanine or cytosine. Full-length wild-type human open reading frames (ORFs) being 7. The designed oligos were grouped by annealing temperature for cloned in GATEWAY entry vectors from the ORFeome collaboration the next step. are stored as bacterial glycerol stocks. (ORFeome Collaboration, 2016) 8. In 96-well PCR plate 10 ng of DNA template together with oligos were used per 50 µL of PCR reaction (denaturation at at 98 °C for 1. The ORFs were inoculated in 96-well plates (Corning), with each 2 min, annealing for 15 s and extension at 72 °C for 5 min, 25 well containing 200 uL of LB medium and 100 µg/ml ampicillin. cycles of amplification) using phusion high-fidelity The plate was incubated at 37 °C and left to shake overnight at polymerase (NEB). 190 rpm. 9. 1 µL of DpnI (NEB) was added to the plate with PCR products 2. In a 96-well PCR plate (Brand) 10 ng of each selected ORF was and incubated at 37 °C for 1 h. The reaction was stopped at 65 °C used per 50 µl PCR reaction (denaturation at 98 °C for 10 s, for 20 min. annealing at 55 °C for 30 s and extension at 72 °C for 3 min, 30 10. The PCR products (6 µl per well) were confirmed through 96-well cycles of amplification) using phusion high-fidelity polymerase E-gel with SYBR (Thermo Fisher, Catalog no G720801) using (NEB) and primers annealing to the backbone of the plasmid 25 µl of loading buffer (Thermo Fisher) and 20 µl of E-Gel 96 (forward: 5′TTGTAAAACGACGGCCAGTC and reverse: 5′ High range DNA marker (Thermo Fisher). GCCAGGAAACAGCTATGACC). 11. 3 µL of digested PCR product was transformed into chemically 3. The PCR products (6 µl per well) were confirmed through 96-well competent DH5a cells (30 µL) in a 96-well PCR plate, then E-gel with SYBR (Thermo Fisher, Catalog no G720801) using recovered in 80 µL of pre-warmed SOC medium at 37 °C for 1 h 25 µl of loading buffer (Thermo Fisher) and 20 µl of E-Gel 96 without shaking. High range DNA marker (Thermo Fisher). 12. 70 µL of transformed bacteria was plated on 48-well square agar 4. In a 96-well PCR plate 1 µl of each amplified PCR product plates and incubated at 37 °C overnight. together with 200 ng of above-mentioned destination vectors were 13. Afterwards, colonies were selected and inoculated into a 96 deep- directly used per 10 µl LR reaction using 4x LR clonase well plate containing 2 ml of LB medium and 100 µg/ml (Invitrogen), thereby generating expression vectors. ampicillin. The plate was then incubated at 37˚C with continuous 5. The full 10 µl of LR reaction was transformed into chemically shaking at 700 rpm in the incumixer for 24 h. competent DH5a cells (30 µl) in a 96-well PCR plate, then 14. The amplified vectors were extracted from the inoculated recovered in 80 µl of pre-warmed SOC medium at 37 ˚C for 1 h culture with Plasmid Plus 96-well Miniprep kit (Qiagen). The without shaking. concentration was measured with a Nanophotometer and 6. 70 µl of transformed bacteria was plated on 48-well square agar diluted to 100 ng/µl. Next, 600 ng of insert was used for full- plates and incubated at 37 °C overnight. length sequencing using primers covering the mutation and 7. Afterwards, colonies were selected and inoculated into a 96 deep- ORF-specific primers (Dataset EV11) to fully cover the ORF well plate containing 2 ml of LB medium and 100 µg/ml length (Dataset EV12). ampicillin. The plate was then incubated at 37 ˚C with continuous shaking at 700 rpm in the incumixer for 24 h. 8. The amplified vectors were extracted from the inoculated culture BRET assay using Plasmid Plus 96-well Miniprep kit (Qiagen). The concentration of each vector was measured with a Nanophot- Transfection ometer and diluted to 100 ng/µl. Next, 600 ng of insert was used HEK293 cells were grown and maintained in high-glucose (4.5 g/l) for full-length sequencing using the backbone primers (tag- DMEM (Thermo Fisher) for BRET assays. Media was supplemen- specific NanoLuc forward: 5′GAACGGCAACAAAATTATC- ted with 10% fetal bovine serum (PAN-Biotech) and 1% Penicillin/ GAC, mCitrine forward: 5′AGCAGAATACGCCCATCG and Streptomycin. Cells were grown at 37 °C, 5% CO2, and 85% RH. reverse: 5′GGCAACTAGAAGGCACAGTC) and ORF-specific Cells were subcultured every 2–3 days and transfected with primers (Dataset EV11) to fully cover the ORFs where it was lipofectamine 2000 transfection reagent (Invitrogen) in Opti- needed (Dataset EV12). All sequence-confirmed ORF sequences MEM medium (Thermo Fisher) using the reverse transfection used in this study are available in Dataset EV13. method according to the manufacturer’s instructions. For transfec- tions, cells were seeded at a density of 4.0 × 104 cells per well in a white 96-well microtiter plate (Greiner) in phenol-red-free, high- Site-directed mutagenesis glucose DMEM media (Thermo Fisher) supplemented with 5% The primers were manually designed using the following criteria: fetal bovine serum (Thermo Fisher). Transfections were performed with a total DNA amount of 200 ng per well. If the expression 1. For point mutation the primers should overlap the site of plasmid concentration amount was below 200 ng/well, pcDNA3.1 mutation. The overlap should be 15–20 nucleotides (nt). (+) was used as a carrier DNA to reach the total amount of DNA of 2. For the deletion the primers should be designed to exclude the 200 ng. All protein pairs were tested in both N-terminal fusion deletion site, but still overlap and the overlap should be as orientations (NL-A with mCit-B and NL-B with mCit-A). The mentioned in step 1. following proteins were also tested as C-terminal fusions: CSNK2B- 3. Primer length should be in the range of 32–36 nt. NL, ESRRG-NL, CUL3-NL, PEX3-NL, PEX19-NL, PSMC5-NL, 4. GC content should be between 40–60%. PEX3-mCit, PEX19-mCit, PEX16-mCit, RORB-mCit, ESRRG- 5. Difference in melting temperature of primers should not mCit, PAX6-mCit, CSNK2B-mCit, PSMC5-mCit, KCTD7-mCit exceed 5 °C. (Dataset EV12). © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 93 121 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al Measurement Fitting of titration curves The plate was incubated 2 days at 37 °C, 5% CO2, and 85% RH Titration curves were fitted using the leastsq function from the before measurements. All measurements were done with the scipy.optimize python package (Virtanen et al, 2020) using the Infinite M200 Pro microplate reader (Tecan). First, 100 µl of the model BRET = ((A/D) * BRETmax)/(BRET50 + (A/D)) described medium was aspirated from each well. The mCitrine fluorescence in (Drinovec et al, 2012), which assumes a 1:1 binding mode, to (FL) was measured in intact cells (excitation/emission 513 nm/ obtain estimates for the BRETmax and BRET50. Standard errors of 548 nm) using a gain of 100. On rare occasions, the plate reader the BRET50 estimates were obtained from the variance-covariance recorded an overflow with these settings (i.e. for GIGYF1 matrix, calculated by multiplying the fractional covariance matrix constructs). In these cases, we repeated the measurement with (output by leastsq function) by the residual variance. Measuring BRET optimal gain settings and used a fluorescein control to normalize signals in intact cells for increasing acceptor/donor protein expression fluorescence signals measured with different gain settings. For this ratios results in an eventual saturation of the signal. Fitting this curve purpose, Fluorescein was obtained from Sigma-Aldrich (Catalog allows extraction of the maximal BRET that can be reached and the No 46955-250MG-F) and used without further purification. A stock BRET50, which is the acceptor/donor ratio at which half of the solution of Fluorescein (1 mg/ml in Ethanol) was prepared by maximal BRET is obtained. The BRET50 is indicative of binding dissolving 1.3 mg Fluorescein in 1.3 ml absolute ethanol. 100 µl of a affinity, in analogy to the IC50, however, its accurate estimation 20 µg/ml solution of Fluorescein were added to an empty well requires saturation of the BRET to be observed in the experimental immediately before starting the fluorescence measurements. The system, which cannot always be achieved because of limited amounts 20 µg/ml solution of Fluorescein was obtained by preparing a 1:50 of DNA that cells can be transfected with. Alternatively, if mutations dilution in water of the stock solution. After measuring the are unlikely to change the overall structure of the fusion constructs and fluorescence, coelenterazine-h (PJK Biotech GmbH) was added to a do not alter expression levels compared to wildtype, single point BRET final concentration of 5 µM. The cells were briefly shaken for 15 s measurements at acceptor/donor ratios prior to BRET saturation are and incubated for 15 min inside the plate reader at 37 °C. After also indicative of changes in binding strength. The BRET titration incubation, total luminescence was measured first followed by curves that we obtained for the PNKP-TRIM37 interaction clearly short-wavelength (WL) and long-wavelength luminescence (LU) deviated from the assumed 1:1 binding mode because at higher measurements using the BLUE1 (370–480 nm) and the GREEN1 acceptor:donor ratios we observed a sudden increase in BRET again (520–570 nm) filters at 1000 ms integration time. Corrected BRET contrary to an expected saturation. The model could thus not be fitted ratios were calculated as described in (Trepte et al, 2018). Briefly, to the titration data. for every transfected protein pair NL-A and mCit-B, the following two control pairs were measured: NL-Stop with mCit-B and NL-A Antibodies with mCit-Stop. The maximal BRET from both control pairs was subtracted from the actual test pair to correct for donor Purified anti-HA.11 Epitope Tag, Clone: [16B12], Mouse, Mono- bleedthrough, unspecific binding to the tags, and background clonal (Biolegend, BLD-901502), 1:2000. signal. Purified anti-GIGYF1, Rabbit, Polyclonal (BETHYL labora- tories, Cat. #A304-132A-1), 1:1000. Determination of binding events in BRET assay GAPDH Loading Control Monoclonal Antibody (GA1R), HRP- To determine whether a protein pair interacted in the BRET assay coupled (Thermo Fisher Cat. MA515738HRP), 1:3000. or not, we used donor:acceptor DNA transfection ratios of 2:50 ng in all cases except for PEX3-PEX16 where we used 8:25 and Co-immunoprecipitation and western blot PEX3:PEX19 where we used 8:50 ng DNA ratios due to low expression levels of PEX3 and a degradation effect of higher PEX16 Snrpb (full-length) and C-terminal truncation mutant (amino acids 1- protein levels on PEX3 expression levels. We requested that 190) was cloned from mouse cDNA and ligated into pFRT-TO cBRETs determined at these transfection ratios were ≥0.05, destination plasmid using AscI and PacI restriction sites. The constructs fluorescence measurements representing mCitrine fusion expres- additionally contain C-terminal 2xHA and mNeonGreen tags. Flp-In™ sion levels to be ≥500 units, and total luminescence measurements T-REx™ 293 Cell Lines (Thermo Fisher, catalog number: R78007) representing NL fusion expression levels to be ≥50,000. expressing Snrpb endogenously from a single locus were generated according to the manufacturer’s instructions. In brief, pFRT-TO and Saturation assay pOG44 plasmids were co-transfected and hygromycin-resistant colonies For donor saturation experiments various donor DNA amounts (1, were grown, picked and expanded. The Snrpb transgene expression was 2, 4 and 8 ng) encoding NL-fused proteins were co-transfected with validated by western blot, RT-qPCR, and immunofluorescence, which increasing amounts of acceptor DNA (12.5, 25, 50, 100, 200 ng) showed that ectopic Snrpb-HA was expressed at levels highly similar to encoding mCitrine-fused proteins. Fluorescence, total lumines- the endogenous Snrpb protein. cence, and BRET measurements were done as described before. For the co-immunoprecipitation experiments, 8 × 106 cells were BRET measurements were corrected for bleedthrough using NL- seeded in a 10 cm dish. The following day, expression of Snrpb-HA Stop transfections. Fluorescence and total luminescence measure- was induced by adding 0.1 μg/mL Doxycycline (D9891, Sigma ments were corrected for background signal using transfections Aldrich) to the culture medium. Parental cells not expressing any with pcDNA3.1(+) and subsequently used to estimate amounts of HA-tagged transgene were used as a negative control of expressed proteins and to plot acceptor/donor ratios on the x-axis immunoprecipitation. The next morning the cells were harvested of titration plots. by scraping in culture media, followed by centrifugation and a 94 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 – 97 © The Author(s) 122 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology single wash in ice-cold PBS. The whole cell extract was prepared by Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, Bryant P, 15 min incubation on ice with 0.3 mL of lysis buffer (200 mM NaCl, Good LL, Laskowski RA, Pozzati G et al (2022) A structural biology 50 mM HEPES, pH 7.6, 0.1% IGEPAL, 10 mM MgCl2, 10% community assessment of AlphaFold2 applications. Nat Struct Mol Biol Glycerol, Protease Inhibitor Cocktail (P8340, Sigma Aldrich), 29:1056–1067 Phosphatase Inhibitor (P5726, Sigma Aldrich) followed by 2 cycles Basu S, Wallner B (2016) DockQ: a quality measure for protein-protein docking of sonication in a Bioruptor Plus (30 s on, 30 s off) and models. PLoS ONE 11:e0161879 centrifugation for 20 min at 16,000 × g. The extract was quantified Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, by a Bradford assay and 1 mg was used for immunoprecipitation, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 for which the NaCl concentration was adjusted to 100 mM final Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, concentration by diluting with an equal volume of Lysis Buffer Murray RR, Roncari L, de Smet AS, Venkatesan K, Rual JF, Vandenhaute J, containing 0 mM NaCl. 0.05 mg was set aside as input control (5%). Cusick ME, Pawson T, Hill DE, Tavernier J, Wrana JL, Roth FP, Vidal M (2009) 0.02 mL of Thermo Scientific™ Pierce™ Anti-HA Magnetic Beads An experimentally derived confidence score for binary protein-protein (Thermo Fisher Cat. 13464229) were incubated with 1 mg protein interactions. Nature Methods 6:91–97 extract for 1 h at 4 °C on a rotating wheel. The beads were washed Bret H, Andreani J, Guerois R (2023) From interaction networks to interfaces: three times before eluting the immunoprecipitated proteins with Scanning intrinsically disordered regions using AlphaFold2. Preprint at BioRxiv 0.02 mL of 1 x NuPAGE™ LDS Sample Buffer by incubating at 42 °C https://doi.org/10.1101/2023.05.25.542287 for 10 min while shaking at 800 rpm. Another 0.01 mL were used Bronkhorst AW, Lee CY, Möckel MM, Ruegenberg S, de Jesus Domingues AM, for elution, were then combined making a total of 30 μL, which Sadouki S, Piccinno R, Sumiyoshi T, Siomi MC, Stelzl L, Luck K, Ketting RF were transferred to a fresh tube and to which 3 μL of 1 M DTT were (2023) An extended Tudor domain within Vreteno interconnects Gtsf1L and added. Input and immunoprecipitated eluates were then separated Ago3 for piRNA biogenesis in Bombyx mori. EMBO J 42(24):e114072 https:// on a 10% Tris-Glycine SDS PAGE using 1xMOPS buffer, doi.org/10.15252/embj.2023114072 immunoblotted on 0.45 μm PVDF membranes (Tris-Glycin Bryant P, Pozzati G, Elofsson A (2022) Improved prediction of protein-protein Transfer Buffer, 10% Methanol, 300 mA, 1 hour), blocked with interactions using AlphaFold2. Nat Commun 13:1265 5% milk in TBS-0.2% Tween for 30 min at RT. Primary antibodies Buel GR, Walters KJ (2022) Can AlphaFold2 predict the impact of missense were incubated overnight at 4 °C on a rocker followed by washes mutations on structure? Nat Struct Mol Biol 29:1–2 and incubation with secondary HRP-labeled antibodies (1 h at RT Bugge K, Brakti I, Fernandes CB, Dreier JE, Lundsgaard JE, Olsen JG, Skriver K, in 5% milk, TBS-0.2% Tween). Blots were developed using Pierce™ Kragelund BB (2020) Interactions by disorder - a matter of context. Front Mol ECL Western Blotting Substrate (Thermo Fisher Cat. 32209) or Biosci 7:110 SuperSignal West Femto Maximum Sensitivity Substrate Kit Burke DF, Bryant P, Barrio-Hernandez I, Memon D, Pozzati G, Shenoy A, Zhu W, (Thermo Fisher Cat. 34095) and imaged on a ChemiDoc MP V3 Dunham AS, Albanese P, Keller A et al (2023) Towards a structurally resolved (Bio-Rad). The cell line was authenticated via X-Gal staining, qPCR human protein interaction network. Nat Struct Mol Biol 30:216–225 and Sanger Sequencing. Chang L, Perez A (2023) Ranking peptide binders by affinity with AlphaFold. Angew Chem Int Ed 62:e202213362 Choi SG, Olivet J, Cassonnet P, Vidalain PO, Luck K, Lambourne L, Spirohn K, Data availability Lemmens I, Dos Santos M, Demeret C, Jones L, Rangarajan S, Bian W, Coutant EP, Janin YL, van der Werf S, Trepte P, Wanker EE, De Las Rivas J, Tavernier J, The datasets and computer code produced in this study are Twizere JC, Hao T, Hill DE, Vidal M, Calderwood MA, Jacob Y (2019) available in the following databases: Maximizing binary interactome mapping with a minimal number of assays. - Interaction data: submitted to the IMEx (http:// Nature Communications 10:3907 www.imexconsortium.org) consortium through IntAct (Del Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Toro et al, 2022) and assigned the identifier IM-29904. Hamelryck T, Kauff F, Wilczynski B et al (2009) Biopython: freely available - Computer scripts for data processing and analysis: available at Python tools for computational molecular biology and bioinformatics. GitHub under https://github.com/KatjaLuckLab/AlphaFold_ Bioinformatics 25:1422–1423 manuscript. Dana JM, Gutmanas A, Tyagi N, Qi G, O’Donovan C, Martin M, Velankar S (2019) SIFTS: updated structure integration with function, taxonomy and sequences Expanded view data, supplementary information, appendices are resource allows 40-fold increase in coverage of structure-based annotations available for this paper at https://doi.org/10.1038/s44320-023-00005-6. for proteins. Nucleic Acids Res 47:D482–D489 Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A, Diella F, Dinkel H, Gibson TJ (2012) Attributes of short linear motifs. Mol Peer review information Biosyst 8:268–281 Del Toro N, Shrivastava A, Ragueneau E, Meldal B, Combe C, Barrera E et al A peer review file is available at https://doi.org/10.1038/s44320-023-00005-6 (2022) The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res 50(D1):D648–53 Drew K, Lee C, Huizar RL, Tu F, Borgeson B, McWhite CD, Ma Y, Wallingford JB, References Marcotte EM (2017) Integration of over 9000 mass spectrometry experiments builds a global map of human protein complexes. Molecular Ajuh P, Chusainow J, Ryder U, Lamond AI (2002) A novel function for human Systems Biology 13:932 factor C1 (HCF-1), a host protein required for herpes simplex virus infection, in Drinovec L, Kubale V, Nøhr Larsen J, Vrecl M (2012) Mathematical models for pre-mRNA splicing. EMBO J 21:6590–6602 quantitative assessment of bioluminescence resonance energy transfer: © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 95 123 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Molecular Systems Biology Chop Yan Lee et al application to seven transmembrane receptors oligomerization. Front Letunic I, Khedkar S, Bork P (2021) SMART: recent updates, new developments Endocrinol 3:104 and status in 2020. Nucleic Acids Res 49:D458–D460 Durocher D, Taylor IA, Sarbassova D, Haire LF, Westcott SL, Jackson SP, Smerdon Leung AKW, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core SJ, Yaffe MB (2000) The Molecular Basis of FHA Domain:Phosphopeptide domain and its implication for snRNP biogenesis. Nature 473:536–539 Binding Specificity and Implications for Phospho-Dependent Signaling Lu R, Yang P, O’Hare P, Misra V (1997) Luman, a new member of the CREB/ATF Mechanisms. Molecular Cell 6:1169–1182 family, binds to herpes simplex virus VP16-associated host cellular factor. Mol Ebersberger S, Hipp C, Mulorz MM, Buchbender A, Hubrich D, Kang HS, Cell Biol 17:5117–5126 Martínez-Lumbreras S, Kristofori P, Sutandy FXR, Llacsahuanga Allcca L, Luck K, Charbonnier S, Travé G (2012) The emerging contribution of sequence Schönfeld J, Bakisoglu C, Busch A, Hänel H, Tretow K, Welzel M, Di Liddo A, context to the specificity of protein interactions mediated by PDZ domains. Möckel MM, Zarnack K, Ebersberger I, Legewie S, Luck K, Sattler M, König J FEBS Lett 586:2648–2661 (2023) FUBP1 is a general splicing factor facilitating 3′ splice site recognition Luck K, Kim D-K, Lambourne L, Spirohn K, Begg BE, Bian W, Brignall R, Cafarelli T, and splicing of long introns. Molecular Cell 83:2653–2672 Campos-Laborie FJ, Charloteaux B et al (2020) A reference map of the human Ernst JA, Brunger AT (2003) High Resolution Structure Stability and binary protein interactome. Nature 580:402–408 Synaptotagmin Binding of a Truncated Neuronal SNARE Complex. Journal of Machida YJ, Machida Y, Vashisht AA, Wohlschlegel JA, Dutta A (2009) The Biological Chemistry 278:8630–8636 deubiquitinating enzyme BAP1 regulates cell growth via interaction with HCF- Evans R, O’Neill M, Pritzel A, Antropova N, Senior AW, Green T, Žídek A, Bates R, 1. J Biol Chem 284:34179–34188 Blackwell S, Yim J et al (2021) Protein complex prediction with AlphaFold- Matsuzaki T, Fujiki Y (2008) The peroxisomal membrane protein import receptor Multimer. Preprint at BioRxiv https://doi.org/10.1101/2021.10.04.463034 Pex3p is directly transported to peroxisomes by a novel Pex19p- and Pex16p- Firth HV, Wright CF, DDD Study (2011) The deciphering developmental disorders dependent pathway. J Cell Biol 183:1275–1286 (DDD) study. Dev Med Child Neurol 53:702–703 McKinney W (2010) Data structures for statistical computing in python. In Freiman RN, Herr W (1997) Viral mimicry: common mode of association with Proceedings of the 9th Python in Science Conference pp 56–61. SciPy HCF by VP16 and the cellular protein LZIP. Genes Dev 11:3122–3127 Mishra M, Jiang H, Wei Q (2023) New insights on the differential interaction of Freund C, Kühne R, Yang H, Park S, Reinherz EL, Wagner G (2002) Dynamic sulfiredoxin with members of the peroxiredoxin family revealed by protein- interaction of CD2 with the GYF and the SH3 domain of compartmentalized protein docking and experimental studies. Eur J Pharmacol 954:175873 effector molecules. EMBO J 21:5985–5995 Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Fujiki Y, Matsuzono Y, Matsuzaki T, Fransen M (2006) Import of peroxisomal Tosatto SCE, Paladin L, Raj S, Richardson LJ et al (2021) Pfam: the protein membrane proteins: the interplay of Pex3p- and Pex19p-mediated interactions. families database in 2021. Nucleic Acids Res 49:D412–D419 Biochim Biophys Acta 1763:1639–1646 Miyata T, Miyazawa S, Yasunaga T (1979) Two types of amino acid substitutions Fujiki Y, Okumoto K, Honsho M, Abe Y (2022) Molecular insights into in protein evolution. J Mol Evol 12:219–236 peroxisome homeostasis and peroxisome biogenesis disorders. Biochim Mo X, Niu Q, Ivanov AA, Tsang YH, Tang C, Shu C, Li Q, Qian K,Wahafu A, Doyle Biophys Acta Mol Cell Res 1869:119330 SP, Cicka D, Yang X, Fan D, Reyna MA, Cooper LAD, Moreno CS, Zhou W, Henrie A, Hemphill SE, Ruiz-Schultz N, Cushman B, DiStefano MT, Azzariti D, Owonikoko TK, Lonial S, Khuri FR, Du Y, Ramalingam SS, Mills GB, Fu H Harrison SM, Rehm HL, Eilbeck K (2018) ClinVar Miner: demonstrating utility (2022) Systematic discovery of mutation-directed neo-protein-protein of a Web-based tool for viewing and filtering ClinVar data. Hum Mutat interactions in cancer. Cell 185:1974–1985 39:1051–1060 Mosca R, Céol A, Aloy P (2013) Interactome3D: adding structural details to Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng protein networks. Nat Methods 10:47–53 9:90–95 Mosca R, Céol A, Stein A, Olivella R, Aloy P (2014) 3did: a catalog of domain- Huttlin EL, Bruckner RJ, Navarrete-Perea J, Cannon JR, Baltier K, Gebreab F, Gygi based interactions of known three-dimensional structure. Nucleic Acids Res MP, Thornock A, Zarraga G, Tam S et al (2021) Dual proteome-scale 42:D374–9 networks reveal cell-specific remodeling of the human interactome. Cell O’Reilly FJ, Graziadei A, Forbrig C, Bremenkamp R, Charles K, Lenz S, Elfmann C, 184:3022–3040.e28 Fischer L, Stülke J, Rappsilber J (2023) Protein complexes in cells by AI- Jehl P, Manguy J, Shields DC, Higgins DG, Davey NE (2016) ProViz-a web-based assisted structural proteomics. Mol Syst Biol 19:e11544 visualization tool to investigate the functional and evolutionary features of ORFeome Collaboration (2016) The ORFeome Collaboration: a genome-scale protein sequences. Nucleic Acids Res 44:W11–5 human ORF-clone resource. Nat Methods 13:191–192 Johansson-Åkhe I, Mirabello C, Wallner B (2021) Interpeprank: assessment of Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, docked peptide conformations by a deep graph network. Front Bioinform Müller A, Nothman J, Louppe G et al (2012) Scikit-learn: Machine Learning in 1:763102 Python. arXiv Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Persson E, Sonnhammer ELL (2023) InParanoiDB 9: ortholog groups for protein Tunyasuvunakool K, Bates R, Žídek A, Potapenko A et al (2021) Highly domains and full-length proteins. J Mol Biol 435:168001 accurate protein structure prediction with AlphaFold. Nature Pozzati G, Zhu W, Bassot C, Lamb J, Kundrotas P, Elofsson A (2022) Limits and 596:583–589 potential of combined folding and docking. Bioinformatics 38:954–961 Krystkowiak I, Davey NE (2017) SLiMSearch: a framework for proteome-wide Schmidt F, Treiber N, Zocher G, Bjelic S, Steinmetz MO, Kalbacher H, Stehle T, discovery and annotation of functional modules in intrinsically disordered Dodt G (2010) Insights into peroxisome function from the structure of PEX3 in regions. Nucleic Acids Res 45:W464–W469 complex with a soluble fragment of PEX19. J Biol Chem 285:25410–25417 Kumar M, Michael S, Alvarado-Valverde J, Mészáros B, Sámano-Sánchez H, Zeke Sobti M, Mead BJ, Stewart AG, Igreja C, Christie M (2023) Molecular basis for A, Dobson L, Lazar T, Örd M, Nagpal A et al (2022) The Eukaryotic Linear GIGYF–TNRC6 complex assembly. RNA 29:724–734 Motif resource: 2022 release. Nucleic Acids Res 50:D497–D508 Teufel F, Refsgaard JC, Kasimova MA, Deibler K, Madsen CT, Stahlhut C, Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic Grønborg M, Winther O, Madsen D (2023) Deorphanizing peptides using character of a protein. J Mol Biol 157:105–132 structure prediction. J Chem Inf Model 63:2651–2655 96 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s) 124 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. Chop Yan Lee et al Molecular Systems Biology Tompa P, Davey NE, Gibson TJ, Babu MM (2014) A million peptide motifs for the a PhD stipend from IMB’s collaborative research initiative. JKV was supported molecular biologist. Mol Cell 55:161–169 by the European Union’s Horizon 2020 UBIMOTIF programme (860517). This Trepte P, Kruse S, Kostova S, Hoffmann S, Buntru A, Tempelmeier A, Secker C, work was supported, in whole or in part, by the Israel Science Foundation, Diez L, Schulz A, Klockmeier K et al (2018) LuTHy: a double-readout founded by the Israel Academy of Science and Humanities (grant number 301/ bioluminescence-based two-hybrid technology for quantitative mapping of 2021 to OS-F). protein-protein interactions in mammalian cells. Mol Syst Biol 14:e8071 Trepte P, Secker C, Choi SG, Olivet J, Ramos ES, Cassonnet P, Golusik S, Zenkner Author contributions M, Beetz S, Sperling M et al (2021) A quantitative mapping approach to Chop Yan Lee: Data curation; Formal analysis; Investigation; Visualization; identify direct interactions within complexomes. Preprint at BioRxiv https:// Methodology; Writing—original draft; Project administration; Writing—review doi.org/10.1101/2021.08.25.457734 and editing. Dalmira Hubrich: Data curation; Formal analysis; Investigation; Trepte P, Secker C, Kostova S, Maseko SB, Choi SG, Blavier J, Minia I, Ramos ES, Visualization; Methodology; Writing—original draft; Writing—review and Cassonnet P, Golusik S et al (2023) AI-guided pipeline for protein-protein editing. Julia K Varga: Data curation; Formal analysis; Investigation; interaction drug discovery identifies a SARS-CoV-2 inhibitor. Preprint at Visualization; Writing—original draft; Writing—review and editing. Christian BioRxiv https://doi.org/10.1101/2023.06.14.544560 Schäfer: Data curation; Investigation; Methodology. Mareen Welzel: Tsaban T, Varga JK, Avraham O, Ben-Aharon Z, Khramushin A, Schueler-Furman Investigation. Eric Schumbera: Methodology. Milena Djokic: Data curation. O (2022) Harnessing protein folding neural networks for peptide-protein Joelle M Strom: Formal analysis; Investigation; Visualization. Jonas Schönfeld: docking. Nat Commun 13:176 Investigation. Johanna L Geist: Investigation. Feyza Polat: Investigation. Toby J Van Roey K, Gibson TJ, Davey NE (2012) Motif switches: decision-making in cell Gibson: Resources; Supervision; Writing—review and editing. Claudia Isabelle regulation. Curr Opin Struct Biol 22:378–385 Keller Valsecchi: Supervision; Funding acquisition; Investigation; Writing— Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, review and editing. Manjeet Kumar: Resources; Formal analysis; Methodology; Stroe O, Wood G, Laydon A et al (2022) AlphaFold Protein Structure Writing—review and editing. Ora Schueler-Furman: Conceptualization; Database: massively expanding the structural coverage of protein-sequence Supervision; Funding acquisition; Writing—original draft; Writing—review and space with high-accuracy models. Nucleic Acids Res 50:D439–D444 editing. Katja Luck: Conceptualization; Data curation; Formal analysis; Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Supervision; Funding acquisition; Investigation; Visualization; Methodology; Burovski E, Peterson P, Weckesser W, Bright J et al (2020) SciPy 1.0: Writing—original draft; Project administration; Writing—review and editing. fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272 Disclosure and competing interest statement Waskom M (2021) seaborn: statistical data visualization. JOSS 6:3021 The authors declare no competing interests. Weatheritt RJ, Jehl P, Dinkel H, Gibson TJ (2012) iELM-a web server to explore short linear motif-mediated interactions. Nucleic Acids Res 40:W364–W369 Open Access This article is licensed under a Creative Commons Attribution 4.0 Zhao G, Li K, Li B, Wang Z, Fang Z, Wang X, Zhang Y, Luo T, Zhou Q, Wang L International License, which permits use, sharing, adaptation, distribution and et al (2020) Gene4Denovo: an integrated database and analytic platform for reproduction in any medium or format, as long as you give appropriate credit to de novo mutations in humans. Nucleic Acids Res 48:D913–D926 the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party Acknowledgements material in this article are included in the article’s Creative Commons licence, We thank all members of the Luck, Gibson, and Schueler-Furman labs as well unless indicated otherwise in a credit line to the material. If material is not as Julian König and Anton Khmelinskii for helpful discussions and input. We included in the article’s Creative Commons licence and your intended use is not thank Izabella Krystkowiak and Norman Davey for helping us access the permitted by statutory regulation or exceeds the permitted use, you will need to SLiMSearch resource with an API. We thank Fridolin Kielisch for advice on obtain permission directly from the copyright holder. To view a copy of this statistical analysis as well as the media lab and protein production core licence, visit http://creativecommons.org/licenses/by/4.0/. Creative Com- facilities of IMB. Support from IMB’s IT department and especially help from mons Public Domain Dedication waiver http://creativecommons.org/public- Christian Dietrich for local installations of AlphaFold is gratefully domain/zero/1.0/ applies to the data associated with this article, unless acknowledged. The GPU cluster on which part of the AlphaFold predictions otherwise stated in a credit line to the data, but does not extend to the graphical were performed was funded by the Ministry of Science and Health (MWG), or creative elements of illustrations, charts, or figures. This waiver removes legal Rhineland Palatinate (funding ID: TB-Nr.:3658/19). We are very thankful for barriers to the re-use and mining of research data. According to standard support from EMBL IT Services and the HPC resources for running AlphaFold scholarly practice, it is recommended to provide appropriate citation and predictions for this project. This work is funded by the Deutsche attribution whenever technically possible. Forschungsgemeinschaft (DFG, German Research Foundation) – Project-IDs LU 2568/1-1 and SFB1551 Project No 464588647 awarded to KL. JS acknowledges © The Author(s) 2024 © The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 97 125 Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf. 2.4.1 Supplementary material 126 Appendix Systematic discovery of protein interaction interfaces using AlphaFold and experimental validation Chop Yan Lee1,†, Dalmira Hubrich1,†, Julia K. Varga2,†, Christian Schäfer1, Mareen Welzel1, Eric Schumbera3, Milena Đokić1, Joelle M. Strom1, Jonas Schönfeld1, Johanna L. Geist1, Feyza Polat1, Toby J. Gibson4, Claudia Isabelle Keller Valsecchi1, Manjeet Kumar4, Ora Schueler-Furman2,*, Katja Luck1,** Affiliations 1 Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. 2 Department of Microbiology and Molecular Genetics,Institute for Biomedical Research Israel-Canada, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel. 3 Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. Current address: Computational Biology and Data Mining Group Biozentrum I 55128 Mainz, Germany. 4 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, 69117, Germany. *Corresponding author. Tel: +972-2-675-7094, E-mail: ora.furman-schueler@mail.huji.ac.il **Corresponding author. Tel: +49-(0)6131-3921440, E-mail: k.luck@imb-mainz.de †These authors contributed equally to this work. Table of content Appendix Text S1. Summary of observations from the manual inspection of AlphaFold models generated from fragmentation approach on PPIs connecting NDD proteins. Appendix Figure S1. Benchmarking of AF on DMI interfaces using minimal interacting regions. Appendix Figure S2. Benchmarking and application of AF for DMI interface prediction using minimal interacting fragments. Appendix Figure S3. Effect of protein fragment extensions on the accuracy of AF predictions. Appendix Figure S4. Effect of protein fragment extensions on the accuracy of AF predictions. 127 Appendix Figure S5. Comparison of AF v2.2 and v2.3 prediction performance. Appendix Figure S6. Performance of different metrics derived from structural models when benchmarking AF v2.3 for DMI predictions. Appendix Figure S7. Expression and BRET50 plots for TRIM37-PNKP and ESRRG- PSMC5. Appendix Figure S8. Structural models, expression, and BRET50 plots for STX1B-FBXO28 and STX1B-VAMP2. Appendix Figure S9. Structural models, expression, and BRET50 plots for PEX3-PEX19 and PEX3-PEX16. Appendix Figure S10. Expression and BRET50 plots for SNRPB-GIGYF1. 128 Appendix Text S1. Summary of observations from the manual inspection of AlphaFold models generated from fragmentation approach on PPIs connecting NDD proteins. run14: PLP1-MFF Top prediction involves an ordered region from PLP1 and a disordered fragment from MFF, with a model confidence of 0.75. Looking at the predicted model, the peptide is tilted at an angle to the bundle of helices of PLP1, not like the usual coiled-coil interaction. No trend in increasing confidence with shorter fragments too. The interface does not look very convincing. While the disordered region in MFF is likely to be a functional motif, the 4-helix bundle domain in PLP1 that AF models it to bind to is known to be a transmembrane domain, so the binding site is actually buried inside the membrane. AF is also not very confident about the domain structure, especially for the parts that are at the membrane surface or outside of it. The prediction is likely wrong. run17: PAX6-CSNK2A1 CSNK2A is a widely active kinase, involved in many processes. Overlapping fragments from PAX6 show trend of increasing confidence the shorter the fragment. CSNK2A1 is predicted to bind with its kinase domain (it doesn’t really has anything else than the kinase domain) to a peptide in PAX6 which seems to be a good looking linear motif, i.e. conserved, not part of a folded domain as predicted by AF and predicted by AF to form an alpha helix. The motif though overlaps with a putative NLS. The PAX6 motif is predicted to bind clearly to a pocket that exists in N-lobe of the kinase domain at the bottom of it, away from the catalytic side. Digging deeper, I found a structure, 1JWH, that shows that this is the pocket that is bound by CSNK2B, the regulatory subunit, that interacts with the catalytic subunit to form an active holoenzyme. This, however, does not eliminate the possibility that the AF prediction is right since the peptide looks like a functional motif. run18: PAX6-SET Top prediction is ordered-ordered, PAX6 Homeodomain and SET NAP domain. The structure 6PAX shows the PAX domain consisting of two similar folds like the homeodomain bound to DNA but the three-helix bundles are not oriented in exactly the same way like in the homeodomain so I am having a hard time to see where the homeodomain would bind DNA; AF models the homeodomain interface with the NAP domain of SAP via a charged interface with a lot of positively charged residues on the homeodomain contacting a patch of negatively charged residues on the SAP domain. It could be that this patch of positively charged residues on the homeodomain would usually interact with the negatively charged backbone of DNA, but the predicted structure from AF looks interesting since the interface likely does not interfere with SET homodimerization (2E50). run19: PAX6-TLK2 All predictions with >0.7 model confidence are paired with the Pkinase domain of TLK2 and they are all predicted to bind at the bottom of the beta barrel fold (N-lobe) of the kinase domain. However, almost all peptides come from very different regions in PAX6, no recurrent predictions here. When looking at the motif pLDDT metric then top predictions also involve two distinct motifs predicted to bind to the long helices in TLK2. However, AF predicts the two helices to form intramolecular contacts. By taking them apart into separate fragments it could be that intramolecular contact sites are now used for interface prediction. 129 The pair of interactions has a DMI predicted, MOD_GSK3_1 (PAX6 395-402). The peptide PAX6 394-404 was paired with the Pkinase domain but similar to the previous point, it is also put at the beta barrel fold in the N lobe and not the substrate binding site. run20: PAX6-NGLY1 The PUB domain from Q96IV0, NGLY1, gives good model confidence, >0.8, in binding overlapping disordered fragments of P26367, PAX6. The PUB domain has been solved before alone (2CCQ), the catalytic domain has also been solved bound to RAD23 (2F4M); in the paper that published the PUB domain structure (Allen et al JBC 2006, 10.1074/jbc.M601173200) they also did some mutational analysis to show that there is an interface on the PUB domain that binds the AAA ATPase domain of p97 but the experimental evidence looks not very convincing. Indeed, AF modelled the peptide from PAX6 to bind to an interface adjacent to the one found by Allen et al. There is indeed some hydrophobic pocket and the best 4 predictions comprise that peptide binding to this pocket, however, which hydrophobic residue of the peptide is docked into the pocket varies depending on the length of the peptide; I think that this region in PAX6 could indeed be a linear motif, it is adjacent to the homeobox domain but I don’t think that it is part of the homeobox domain. run21: PAX6-ESRRG Many short fragments with high model confidence that are scattered over the disordered region. The binding pocket on ESRRG is in the hormone receptor domain and is a known binding pocket for binding to L..LL motifs (ELMDB: LIG_NRBOX). According to ELMDB, the first and last L go into a hydrophobic pocket and all fragments of PAX6 with high model confidence have more or less two hydrophobic amino acids with three residues in between: PAX6 319-329: DTALTNTYSA, PAX6 203-213: RLQLKRKLQR, PAX6 374-384: PPHMQTHMNS, PAX6 198-208: DEAQMRLQLK, PAX6 128-148: GADGMYDKLR. Looking at structures with ESRRG and two different bound peptides: 1KV6 and 1TFC: NCOA1 686-700: RHKILHRLLQEGSPS, 2GPO and 2GPP: NRIP1 378-387: SLLLHLLKSQ, it furthermore became apparent that the hydrophobic residues right before both Leucines are also important for binding since they contact a hydrophobic patch on the other side of the pocket. However, none of the AlphaFold predicted motifs really fit, it is thus questionable whether they can actually bind the pocket. Structurally speaking, the peptide does not fit that nicely in the hydrophobic pocket. In 2GPO and 2GPV, there is a triad of hydrophobic residues (V/L/I) making contact with the hydrophobic pocket on the domain but here only 2 residues are making contact. Therefore, it seems doubtful to me that this is a motif that can bind to the domain. run22: PAX6-QRICH1 Difficult to dig deeper because QRICH1 has only one domain (DUF) that binds to C terminus peptide from PAX6. The high confidence peptide is 20 aa long and seems nice with 0.88 model confidence. The same DUF is also modelled with 0.76 confidence with a very long disordered region (85 aa) that is at the N terminus of PAX6. However, the predicted complex of this disordered region is quite odd, as it has many twists and turns that seem weird to me. Overall, these predictions look good but it’s hard to be very certain about it because nothing is known about the domain in QRICH1 and PAX6 has a long disordered C-terminal region full of S, T, but also some Ps and hydrophobics. 130 run23: PAX6-KCTD7 The top prediction involves the disordered region of PAX6 (198-208) and BTB_2 domain of KCTD7, with 0.74 model confidence. No trend of increasing confidence when fragments shorten. InterPro describes this domain as one that multimerises for its protein function, e.g. KCTD1 as a transcriptional repressor (3DRX, solves KCTD5 that has a similar fold but shorter in length). Since BTB domain mediates the multimerisation of KCTD, it could be that it requires a certain stoichiometry for binding to its partner. In the HuRI database, KCTD7 was indeed detected to interact with itself. The two highest predicted models put both peptides into the same pocket and both peptides have some sequence similarity albeit from different regions in PAX6. These peptides were also predicted with high model confidence in other runs. Based on the structure 5FTA, BTB domains in their homodimerized form do expose the surface predicted in the top prediction. Therefore, the surface predicted to bind to the peptide would be available. Taken together, the prediction looks plausible. run24: TTC19-FH Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run25: PEX3-PEX16 PEX3 and PEX16 are two proteins that seem to cooperate to help inserting new peroxisome membrane proteins (PMPs) into the peroxisome membrane. They do so via interaction between PEX3 and PEX19. PEX19 brings the PMPs to the peroxisome where PEX3 and PEX16 sit and mediate then further insertion of the cargo (this is described in review Smith and Aitchison 2013 Nature Rev Mol Cell Biol in Fig 2). However, there is also a study that describes how PEX16 localizes to ER and from there traffics to peroxisomes (Kim et al JCB 2006). The structure 6AJB that has been solved for the interaction between PEX3 and PEX19 was published by Sato et al EMBO J 2010 and describes how an N-terminal SLiM in PEX19 binds to the domain of PEX3. They tried to crystalize the whole protein of PEX3 but only observed residues 52-368. The domain has the exact same fold as predicted by AF. The predicted cytosolic and peroxisomal localization of protein regions and the two TM helices that are shown in Uniprot seem to be wrong for PEX3 according to work cited in Sato et al. They summarize that the N-terminal region of PEX3 contains a targeting signal or anchor for PEX3 to the peroxisomal membrane followed by the domain that is located in the cytosol. No structure has been solved yet for PEX16 but it seems likely that the prediction of two TM helices that are shown in UniProt in this protein is also wrong. AF predicts a globular domain containing the two TM helices and has a nicely exposed loop that carries the putative SLiM that AF predicted to bind to PEX3. It binds onto PEX3 on the opposite side to PEX19 binding, so PEX19 and PEX16 could bind simultaneously to PEX3. Further work on these interactions can be done by submitting the three protein sequences to AF to see what it does. Some other study observed interaction between PEX3 and PEX16 according to Uniprot but the interface really does not seem to have been looked at before nor the interaction studied in detail. All the fragments that contain the putative SLiM in PEX16 are predicted in the exact same way to bind to PEX3; always anchored via a conserved region sitting in PEX16 between residue 160 and 190. Interestingly, the most conserved residues are also those that seem most important for binding. This smells really good. 131 run26: PEX3-PEX19 This is a positive control interaction since the structure has been solved for this PPI (3AJB) and it is a well known and well studied PPI with an entry for it in the ELM DB: LIG_Pex3_1 (L..LL...L..F). This ELM instance is indeed predicted by AF to be the highest model confidence. Another peptide from PEX19 121-141, FTSCLKETLSGL, scored equally high model confidence. It could be that this other predicted binding site is also true but I believe that it is rather an artefact from AF’s insensitivity to mutations. run27: GABARAPL2-UBA5 The structure 6H8C shows binding of GABARAPL2 domain to LIR motif in UBA5. This motif is not listed on the ELM website for LIG_LIR_Gen_1 because it does not quite fit the regular expression which seems to be defined too narrowly. AlphaFold correctly predicts this interface but only as third highest based on model confidence just hitting the cutoff of 0.7 while using chainB_inf_avg_plddt it scores as fourth best prediction far below the cutoff (67). However, AF recurrently finds peptides including the motif following each other when ranked by model confidence or pLDDT. The top three motifs predicted to bind to GABARAPL2 are not finding the hydrophobic pocket that is filled by a key big hydrophobic residue in the motif and these peptides are also not recurrently predicted. So, I think these are wrong predictions. run28: GABARAPL2-LZTR1 GABARAPL2 (P60520) has Atg8 domain that is known to bind motifs (LIG_LIR_Gen_1). The domain is modelled with high confidence to bind to different disordered fragments of interacting partner LZTR1 (Q8N653). The second top confident model (when ranked by model confidence) has an aromatic residue tucked into a deep pocket and a branched alipathic residue tucked into another shallow pocket. The top confident model has some kind of increasing trend in model confidence as fragments get shorter, with the shortest one getting the highest confidence. The highest confidence model has a nice increasing model confidence trend but it does not have an aromatic residue fitting into the deep pocket as it is known for LIG_LIR motifs. Looking at the structure 2LUE, the second top model LZTR1 46-52 GPFETVH looks more similar in sidechain positioning compared to 2LUE. Residues highlighted in bold get tucked into the mentioned pockets. This model seems more likely to be true than the best model. However, it also is predicted to bind in reverse order compared to structure 3WIM. run29: CUL3-KCTD7 Has an ordered-ordered prediction with quite high confidence (0.66) but the contact interface is a tetramerization domain from KCTD7. Therefore it seems unlikely that it is a functional interface. Two N terminus disordered fragments from KCTD7 with > 0.7 model confidence when paired with the Cullin domain of CUL3. These two fragments are modelled to be binding at the same site of Cullin domain (the site where RING proteins bind to, 1LDJ). In the case of 1LDJ, the RING protein has a long disordered region inserted into the Cullin domain of CUL1, burying a series of hydrophobic residues in the long disordered region. However, the same binding site of the Cullin domain of CUL3 is a bit different, with more surface exposed than CUL1. In this case, the contacts modelled in KCTD7 16-26, with a triple Serine making contact with the Cullin domain, look plausible. The other high confidence peptide KCTD7 1-11, with triple Valine making contact with the Cullin domain, also looks plausible to me. 132 In the structure of 1LDJ it is really amazing how the partner protein interacts with CUL1 via beta-sheet augmentation but how this extra beta strand becomes part of the integral fold, it is kind of in the middle of the domain. I think AlphaFold feels that there is something missing and is trying to put a peptide there but the overall conformation of the domain is also different at places so that the predicted peptide does not sit at the same position like the one shown in 1LDJ. AlphaFold predicts two different motifs of very different sequence from the N-terminus of KCTD7 to bind there. Given how different the sequences are, this adds another negative point towards questioning the specificity of these predictions. run30: PNKP-SYP Top prediction is a disordered fragment from SYP (7-19) paired with the kinase domain of PNKP. The binding surface is different from the nucleotide binding surface (1RC8). This binding interface looks plausible. It was later found that the kinase and phosphatase domain form a structural unit based on published structures. The run is modified to use the kinase and phosphatase domain as an ordered region for prediction with disordered fragments of SYP. The rerun with a fragment comprising the phosphatase and kinase domain now resulted in one prediction that makes the cutoff. This prediction puts a motif from SYP into the DNA binding pocket of the kinase domain (according to Bernstein et al Mol Cell 2005, 1RC8). There is another predicting docking a peptide from SYP into the FHA domain of PNKP. It puts it where FHA domains bind their phosphorylated peptides but the SYP peptide has no Ser or Thr. run31: PNKP-TRIM37 The first prediction involving the combined kinase-phosphatase structure puts a peptide of TRIM37 into the binding pocket where the phosphatase domain would bind single stranded DNA. Following up is a prediction that involves a disordered region in PNKP binding to the surface of MATH domain of TRIM37 where MATH domain-binding peptides generally bind to. The PNKP peptide differs slightly in sequence from regular expression patterns described for MATH domains in the ELM database. This peptide in PNKP has a known phosphorylation site that stabilizes PNKP protein levels, making the peptide very interesting since this suggests a regulatory role of phosphorylation on the peptide. There is a second peptide of PNKP predicted to bind to the MATH domain also with high confidence but the sequence is quite different from the first one and very close to the phosphatase domain. There is also a prediction where the FHA domain of PNKP is predicted to bind to a peptide of TRIM37 but the peptide looks very different from known FHA-binding motifs (peptide with phosphorylated threonines), which is of course difficult to predict for AF. run32: PNKP-XRCC4 XRCC4 and PNKP prediction, there is a peptide from XRCC4 that binds to the phosphatase domain with high confidence. But then I am not sure if this is right because it could be a false prediction of a small peptide easily fitting into the catalytic site of the phosphatase domain. There is a Serine in the peptide, so it is possible that this is where the phosphate group gets cleaved off by the phosphatase domain. After checking more, it is found that XRCC4 is known to bind to PNKP via a phosphorylated motif that binds to the FHA domain in PNKP. In principle, it would be better to make a rerun where the kinase and phosphatase domain are taken as one fragment since they form 1 structural unit but I think in this case it would not have changed anything. The best prediction put a peptide from XRCC4 into the 133 pocket of the phosphatase domain where it would bind the single-stranded DNA as seen in 3U7G. Among the first 9 predictions AF put 7 different peptides from XRCC4 into the phosphatase, the others go to the kinase domain. The first prediction that involves the FHA domain of PNKP and contains the FHA-binding motif in the sequence fragment of XRCC4 has a confidence score of 60 and does not put the FHA-binding motif in the pocket but another negatively charged peptide in the sequence (the FHA pocket is very positively charged). The correct prediction where AF puts the FHA-binding motif in the right pocket has a confidence score of 0.58. run33: TNPO3-GCH1 Top prediction involves the disordered region of GCH1 (16-26) and the superhelical structure of TNPO3, with model confidence 0.71. Since TNPO3 (transportin) is known to transport cargo into the nucleus by releasing the cargo via the competitive binding of GTP-bound Ran (2X19), the peptides from GCH1 are modelled to be at a binding site near where Ran binds in 2X19. It is therefore biologically sound where the peptides are modelled at. The binding site of the peptides from GCH1 is also lined with many arginines, making it very positively charged. The contact modelled by AF in the top prediction looks good, with many charge-charge interactions at the interface. The N terminus of GCH1 has many prolines that are conserved, with three repeats of PAEK or PEAK and two repeats of PPRP. run34: TNPO3-CAMK2G Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run35: GNAI3-GPSM2 This interaction has been structurally solved (4G5S) and AlphaFold predicted the interface 100% accurately. GPSM2 has multiple GoLoco motifs that AlphaFold predicts individually with high confidence to bind in the pocket on GNAI3. run36: SYT1-MIP Both are transmembrane proteins. The top prediction involves the linker between two C2 domains of SYT1 and the MIP domain of MIP. MIP domain is also known as aquaporin domain (transmembrane). However, when the linker is fragmented, it receives lower confidence. I think this is unlikely to be the interaction interface. The linker could be a motif for some other interaction because of its moderately high plddt. There is a structure of a homodimer of SYT1, 2R83, that shows that both C2 domains of one chain are actually interacting with each other and that the linker between both domains interacts with one domain. It is this linker where AF predicts that a peptide would bind to the porin domain of MIP; interestingly, AF predicts the two C2 domains to be independent from each other in the monomeric structure of SYT1, so either AF is wrong or crystallization introduced the packing of both domains against each other but I would rather believe the Xray structure and in this case the peptide would not be accessible to bind to the MIP domain. run37: FTSJ1-CERT1 FTSJ domain of FTSJ1 is known to bind S-ADENOSYLMETHIONINE (see structure 1EJ0). The top predictions all look very different in that different regions or partially overlapping regions of CERT1 are docked into different sites onto the FTSJ domain. Sometimes the 134 peptide is docked into the catalytic pocket where the protein methylates adenosines on tRNAs but the peptide is also docked elsewhere. Because of these ambiguities, I believe that the predictions are questionable since they seem to lack specificity but I don’t think we can call them definitely wrong. Another interface was found involving CERT1 368-388, with model confidence 0.70. However, the contacts modelled are mostly backbone-to-backbone. I have previously noticed that AF tends to give higher confidence to complex modelled with secondary structure. So I think this is also a likely false interface. run38: CAMK2A-SOX5 Kinase domain of CAMK2A with a disordered fragment predicted showed high confidence. The structure predicted by the highest confidence model is weird, with both beta sheet and helix structure. Kinase domain of CAMK2A is likely serine threonine kinase and in kinase domain prediction, one has to be careful with the two lobes that bind substrate and ATP. It might be interesting to check other high scoring peptide to see if they have S/T that can be phosphorylated and check the crystal structure to find substrate binding pocket. The first two highest scoring peptides do not look convincing because the first one has no S/T in the peptide but it is fit into the catalytic cleft while the second one has positioned the sidechain of a T out of the cleft. The third highest scoring peptide (P35711 131-141) looks nice because it positions the sidechain of an S into the catalytic cleft. The highest ranking peptides are essentially all over the place from SOX5 and I don’t think that AF can predict very well kinase-substrate interactions. Overall, the high-scoring predictions all do not look very convincing. run39: CAMK2A-CAMK2G Many high confidence predictions involving different regions in the protein pair. Among them, one ordered-ordered interface gives a really high confidence. The interface is a known DDI in 3did with high zscore. The structure 3SOA only shows one CAMK2A monomer but the publication talks about a dodecamer for which one can download a model from the PDB as well. Looking at this dodecamer and the paper, it becomes clear that downstream of the kinase domain there is another domain referred to as hub domain in the paper which mediates oligomerization, together with the linker between the kinase and hub domain. The best AF prediction for the interaction between CAMK2A and CAMK2G involves both hub domains and is an accurate prediction of the interface seen in the dodecamer. The second best prediction made by AF involves the hub domain and a bit of the linker sequence from the other partner. Looking at the dodecamer, one can see that where the peptide is predicted to bind on the hub domain is part of the linker sequence bound from the same monomer, so an intra-molecular interaction. So, there is indeed some binding site but not for inter-molecular interaction. Because the linker sequences are different in the structure and canonical uniprot sequence it is very difficult to know which part of the linker is binding on the hub domain and whether this corresponds to the bit of the linker sequence predicted by AF to bind there. In the paper accompanying the 3SOA structure they also investigate how different linker sequences from different isoforms influence Ca-binding site accessibility and thus activation of the complex. There is evidence from 3 other studies that CAMK2G and CAMK2A interact with each other from co-IP experiments but these were large-scale studies. It is likely that no one has studied the interface between CAMK2G and CAMK2A and thus would be something new. 135 run40: ACTB-ACTG1 Two actin proteins are predicted to have high confidence DDI. The interface itself that is predicted by AlphaFold looks very interesting, it indeed looks like a polymerization interface because both domains interact with opposite sites. interactome3D would model this interaction with the structure 4JHD as a template but this one looks quite different, it’s not the same interface and needs according to the authors a third protein for polymerization. Digging deeper in PDB for structures of ACTB, I found structure 6ANU which shows the same interface that AF predicted between ACTB and ACTG1, so the interface is probably right. This is also a very interesting case. Based on the review by Vedula and Kashina (J of Cell Signal 2018, 10.1242/jcs.215509), it is still an open question whether the different actin forms that exist in human can form heteropolymers or not. Some studies find this in vitro, other find intermingled homopolymers of beta and gamma actin. Both actins co-occur in many cell types while alpha-actin is more specifically expressed in muscle. It seems really tricky to solve this since actins are highly studied and actins are also super similar in their sequence, so it could be that in a somewhat artificial system, beta and gamma actin can interact because the interface residues are identical but in vivo they would rather not interact and rather form homopolymers. In the end, whether ACTB and ACTG1 indeed interact in vivo is the only open question. run41: RARB-PSMC5 PSMC5 has been repeatedly modelled by AF to have a high confidence peptide that binds to partners with Hormone_recep domain. The peptide is 132-141 DPLVSLMMVE. Residues highlighted in bold are the ones tucked into the hydrophobic pocket. However, this peptide does not match with the consensus of LIG_NRBox (^PL..LL^P), especially in this peptide P precedes the first L. I am not sure why P is disallowed at first position as ELM has not described much about the sequence composition of the motif. I think it might be too early to reject this peptide because the highlighted residues are indeed hydrophobic and can serve similar functions as those in the regex. I looked at the HuRI network of PSMC5 too, and found that the interactors seem to be enriched with the Hormone_recep domain, making this interface even more plausible. run42: DCX-BICD2 DCX has two DCX domains and all good predictions involve the N terminus DCX domain. The N terminus DCX domain is known to bind Tubulin. AF modelled a different interface on the N terminus DCX domain to bind to disordered fragments from BICD2. The DCX domains have a C-terminal part that is not confidently predicted by AF to be part of the fold. When excluding this part from the first DCX domain, AF models peptides to bind to the area where this last part is predicted to be located in the monomeric structure from AF. When we use a DCX domain that contains this last bit, then AF predicts other peptides from BICD2 to bind on the opposite side of DCX. There is no consistency in these predictions. There are no other predictions between ordered-ordered or disordered fragments binding to ordered domains in BICD2 that make the cutoffs. BICD2 however, also only consists of large helices. Nonetheless, it could be that both DCX domains together bind to one of these coiled coil helices in BICD2. run43: DCX-ZBTB10 136 A possible prediction involves the first DCX domain of DCX and a peptide of ZBTB10 261-271. This prediction is not influenced by the actual domain boundaries because the peptide is not docked into the pocket where a region a little C-terminal of the domain might bind to. This is the case for the second best prediction involving the first DCX domain and peptide 604-614. According to chainB inf avg plddt these are the only two prediction that make the cutoff when looking at chainB as a disordered region. ZBTB10 has a lot of disorder and probably many motifs. DCX has two DCX domains and a bit of disorder. Looking into available PDB structures then the DCX domains are known to bind to microtubules. There is one structure with the first DCX domain bound to microtubules (6RFD). It seems though that the pocket where ZBTB10 261-271 is predicted to bind is not occupied in this complex. AF does not predict slightly extended versions of this peptide with reasonable confidence to bind to this pocket. A peptide was also predicted to bind in beta-sheet augmentation to the last beta strand of the BTB domain with reasonable model confidence and chainA_intf_avg_plddt scores but the ZBTB10 model might have its own beta strand C-terminal of the current domain boundaries that AF predicted to complement the last beta strand of the domain as predicted in the full length model of ZBTB10. AF also predicts a contact between the ZnF domain of ZBTB10 and the first DCX domain but it does not look very likely and I think the ZnF fold is perturbed. run44: PSMC5-ESRRG The interaction has quite some high confidence predictions. The highest scoring peptide is P62195, PSMC5, 132-141, DPLVSLMMVE. The three hydrophobic residues make nice contacts with the hydrophobic pocket and surface of the domain. Another disordered fragment from PSMC5 binding to the same domain, IKKLWK, also looks promising. However, there is some possibility that these are artefacts because AF is not very specific when it comes to detecting single mutation in known motifs. The sequence alignments are not helpful unfortunately because the whole PSMC5 is super conserved. Nonetheless, interaction between PSMC5 and ESRRG looks promising because the alternative name is thyroid hormone receptor-interacting protein 1, TRIP1. run45: PSMC5-RORB The highest confidence prediction involves a disordered fragment from PSMC5 and it is the same as run44. The ordered region from RORB is the same domain, hormone receptor domain, as run44. It is interesting to see AF predicting similar DMI with high confidence from two different proteins. Same observation as run44. run46: WAC-NFE2L2 WAC and NFE2L2 are largely disordered. WAC has a WW domain. AF predicts recurrently a sequence close to the N-terminus of NFE2L2 to bind to the WW domain that are known to bind proline-rich motifs. The putative motif in NFE2L2 does not contain prolines and is not docked onto the WW domain in any way like other WW domains, e.g. 1EG4. These are likely wrong predictions. While the motif interface pLDDT is reasonably high for these predictions, the model confidence does not reach the 0.6. There are no other predictions that make the cutoff. run47: WAC-MOBP 137 Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run48: STX1B-FBXO28 The top model has 0.76 model confidence that utilizes the disordered region 1-22 of STX1B and Fbox + helix bundle domain (63-221) of FBXO28. The interface involves the disordered region of STX1B forming a 310 helix structure with the helices from the Fbox domain. Note that the Fbox domain annotated by InterPro is from 61-109, while the ordered region that I used for prediction is 63-221. The Fbox domain is known to mediate PPI but it is not used by AF to model the interaction in this prediction. Region 1-22 of STX1B is conserved only in recent homologs. The plddt of the disordered region is low, <60 for all residues. The second top model has 0.75 model confidence that involves the syntaxin domain (23-237) of STX1B and disordered region 354-368 of FBXO28. The disordered region of FBXO28 is at the C terminus and conserved. However, the plddt of the peptide is low and adopts a 310 helix kind of structure. A slightly different prediction involving fragments of the proteins (27-219 STX1B and 345-363 FBXO28) returned 0.73 model confidence. The peptide adopts a helical structure but is placed on a different surface of the syntaxin domain. Although the peptide 345-363 has good plddt (mosty >60), I am not sure if this is the right interface. One prediction pairs the full length of STX1B with the disordered region 354-368 of FBXO28 and returned 0.71 model confidence. The interface is similar to that of the syntaxin domain (23- 237) of STX1B and disordered region 354-368 of FBXO28 with low plddt. This region 354-368 in FBXO28 could be an nuclear localization signal (NLS), where ELMDB also predicts quite a few NLS, and therefore unlikely to be the interface for the interaction. Next top prediction has 0.749 model confidence that involves the C terminus of the syntaxin domain (220-232) of STX1B as disordered region and the Fbox + helices domain of FBXO28 (63-221). The interface is formed by the peptide adopting a helical structure with the Fbox + helices domain. The plddt of the peptide is good, with all residues above 60 plddt. Nonetheless, another prediction involving slightly longer peptide from the same region of STX1B has a much lower model confidence (0.55). The interface modelled is not exactly the same as it is a little bit shifted. Unsure if this is a good interface. I tried to find more molecular studies on the two proteins but I can’t find much. STX1B is known to function in docking of synaptic vesicles at presynaptic active zones while FBXO28 probably recognizes and binds to some phosphorylated proteins and promotes their ubiquitination and degradation. Weirdly, STX1B is known to localize to membrane while FBXO28 has not much information on subcellular localization but studies have shown that it interacts with topoisomerase using its Fbox domain (the bundled helices are not needed for interaction). Out of all the predictions, I think STX1B 27-219 + FBXO28 345-363 and STX1B 220-232 + FBXO28 63-221 are most likely to be the interface, as their peptides are modelled with good plddt and both achieved model confidence higher than 0.7. run49: STX1B-MMGT1 Top prediction involves the Syntaxin domain of STX1B and the disordered region of MMGT1 (23-31) with confidence 0.73. A slightly longer fragment has a slightly lower confidence but looking at the structure, the two peptides have different angles to the Syntaxin helical bundle. Since the interfaces modelled by AF differ a lot despite using the same peptide and its extended counterpart, the modelled interfaces do not look genuine. 138 run50: STX1B-VAMP2 Interactome3D models an interface between both proteins based on the structure 3HD7/3IDP where STX1A interacts with VAMP2. STX1A and STX1B are very similar in structure. STX1B is predicted in closed conformation, which we know because structures exist of STX1A bound to Munc18 where it is in this closed conformation with the long C-terminal helix comprising the SNARE domain folding back onto the syntaxin domain. However, when bound to VAMP2 we can see the open conformation where the long helix is made available to bind in coiled-coil like manner to VAMP2 and SNAP25 helices. Based on this available structural information we designed different fragments of the extended SNARE domain of variable length. VAMP2 is a short protein of 116 residues consisting of a long helix and about 30 disordered residues at the N-terminus. The most confident predictions obtained for these fragments is the one modeling a coiled-coil interaction between the extended SNARE domain and the helix of VAMP2 but the model confidence is slightly below the cutoff. Predictions with the disordered N-terminal region of VAMP2 remain far below cutoffs. run51: CSNK2A1-CSNK2B Nice prediction with overlapping fragments showing increasing model confidence. This interface has been solved before in two structures: 4DGL and 6Q38; prediction is highly accurate, and is probably a DMI that is not in ELM yet. run52: EBF3-EBF2 Dimerization of the EBF family already known and solved (3MUJ). AF predicts the middle domain of both proteins called TIG as the dimerization interface as top prediction but in head to tail orientation while the structure 3MUJ shows head to head orientation. Followed closely up in terms of score (avg_intf_plddt) is the fragment comprising the TIG domain and the helix loop helix domain which are predicted accurately as seen in the structure. The third best prediction involves the N-terminal DNA binding domain as the dimerization interface. Does not look so convincing to me but still got a very high score. The fourth best prediction is the helix-loop-helix domain alone as dimerization interface, still with a score of 90. There are more predictions that make the cutoff that involve various disordered regions of either protein and ordered fragments from the other involving interfaces used for dimerization but I guess that these predictions are likely wrong. run53: PEX12-TREX1 The disordered region of PEX12 215-312 (98 residues long) is predicted with high confidence. One fragment of it achieved even higher confidence but when this fragment is further fragmentate, their confidence is not as high anymore. After checking the protein on InterPro, this domain is the exonuclease domain of TREX1 that binds to ssDNA (2OA8). In this crystal structure, it shows the pocket modelled by AF to bind PEX12 215-312 is bound to a ssDNA, with the phosphodiester bond of ssDNA making interactions with the backbone of the domain chain and some hydrophobic side chain (leucine) making hydrophobic interaction with the base of the nucleotide. Interestingly, AF seems to have memorized this crystal structure because the bound ssDNA has a curved structure and AF also models the long disordered region to have an odd curve. I think this interface is unlikely to be true because the bound magnesium ions coordinate with the oxygen in the phosphodiester bond of ssDNA and the modelled helix places hydrophobic sidechains to the cavity where magnesium ions bind. 139 A very short fragment of PEX12 12-16 at the N terminus is modelled with high confidence with a very negatively charged pocket in the domain of TREX1. It is unusual to have a peptide binding pocket with such a high negative charge. Further checking revealed that this domain binds magnesium ion and nucleotides. The short fragment fits into the magnesium binding pocket and thus this is unlikely to be true. run54: PRKAR1A-PRKAR1B Best model is an ordered-ordered prediction with 0.83 confidence. It is a homo-DDI (RIIa domain) dimerization and has been solved in 2EZW. An additional disordered fragment (PRKAR1A 360-372) predicted with high model confidence but low pLDDT with the cyclic nucleotide binding domain of PRKAR1B. Referencing available structure of cNMP binding domain (1NE4), there are two beta barrel folds in the domain that bind to cyclic nucleotides. AF fits the disordered fragment on a hydrophobic surface near the beta barrel but not in the cNMP binding pocket. Although this could be another binding site, the binding makes little sense to me because the disordered fragment is at the C terminus of cNMP binding domain of PRKAR1A, meaning that the sequence would have to loop back to make this contact. In the previous bullet point, it seems very likely that the dimerization of the two proteins are mediated by the RIIa domain (N terminus), so it seems not so plausible to me that at the C terminus they make contact again. This is likely a false positive interface. run55: ASF1A-H4C8 The interaction between both proteins has been solved (5C3I). However, this structure shows that the motif in H4 sits at the very C-terminus and binds in beta sheet augmentation to ASF1A in the same pocket like AF predicted but using an N-terminal peptide of H4. I think the problem is that the C-terminal region of H4 was made part of the domain of H4, which I agree was hard to see from looking at the monomeric AF structure for full length H4; I checked further down in the predicted structures but the first ordered-ordered prediction has a model confidence of 0.25 and does not find this mode of binding either. One could rerun this by taking the C- terminal peptide of H4 as disordered region just to see whether AF would then get it right but in principle this is a false positive prediction; the N-terminal peptide also shares no sequence similarity with the C-terminal motif. run56: RARS1-CCDC115 There is only one prediction that makes the cutoff for model confidence or/and motif pLDDT. This prediction involves RARS1 1-21 as a disordered fragment that is modelled to bind as a helix to the two helix coiled-coil domain of CCDC115. A shorter fragment of the motif is placed elsewhere. The helix of CCDC115 to which the peptide is predicted to bind has more hydrophobic residues along the helix on that side so I would think that a longer partner chain would be able to bind there. Thus, this interface does not seem likely to be true. run57: UBE3A-TAT Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run58: VAMP4-MFF 140 Top prediction is two ordered regions that are both helical. Both proteins have only helical regions and the rest are disordered. Interestingly, despite the top predicted interface having only 0.71 model confidence, both chains have very high plddt for their residues at the interface (95 for VAMP4 and 90 for MFF). Because of their high plddt, it could be a genuine interface. The helix in VAMP4 definitely has an interface there because one side is rather hydrophobic while the other side is rather hydrophilic. MFF could bind there with its helix or via another helix that it has. The binding does not show that many nice contacts, i.e. some hydrophobic residues on the VAMP4 helix still remain exposed. run59: PEX16-MMGT1 Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run60: PLP1-SLC16A2 Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run62: SNRPB-GIGYF1 GIGYF1 is a very long protein with many disordered regions. It has a GYF domain that is known to bind proline-rich sequences. SNRPB has many proline-rich sequences in its C terminus. Some proline-rich motifs are predicted with high pLDDT to bind the GYF domain (these are the top predictions). Another highly ranked prediction involves the LSM domain of SNRPB with various disordered fragments from GIGYF1. However, checking InterPro entry as well as structures showing LSM domain, it seems like LSM domain is predominantly involved in multimerization with other SNRP proteins to form the SMN complex involved in splicing (1H64). Therefore, the models involving this domain with disordered fragments look unlikely to be true to me. Digging deeper into the top predictions, comparing the binding modelled by AF between SNRPB 231-240 and GYF domain of GIGYF1 with 1L2Z, the peptide is oriented differently. However, from 3FMA, one can see different ways a peptide binds to the same surface of GYF. In 3FMA chain E and P show a similar way of binding to that modelled by AF. The peptide sequence in 3FMA is also different from 1L2Z, but importantly, there are three prolines in the peptide that always orient the same to the hydrophobic surface formed by the GYF motif on the GYF domain. This orientation of the 3 prolines is captured by AF. AlphaFold repeatedly predicts the PPGM motif in the same pocket. This motif occurs multiple times in the C-ter tail of SNRPB. On the ELM website, the LIG_GYF motif is described to bind proline-rich sequences and they also cite the structure 1L2Z but they say that flanking positively charged residues seem to be important for binding to the GYF domain. Indeed, in the crystal structure there are some negatively charged residues on the GYF domain. Interestingly, the GYF domain from GIGYF1 does not or only partially has those. It also differs in that it has a deeper hydrophobic pocket which is filled with a Trp in the crystal structure. So, it could well be that the GYF domain from GIGYF1 binds somewhat different proline-rich peptides. The interaction between GIGYF1 and SNRPB has not been described before other than in HuRI. Functionally, it would be probably a new connection because GIGYF1 is not known to function in splicing as far as I can see and thought to be localized to the cytoplasm. GIGYF1 however, has also interacted with SNRPA and SNRPC in HuRI. They also have 1 or 141 some more occurrences of the PPGM motif. If this mode of binding is true then it would be somewhat of a new mode of binding or in the most conservative case an extension of the known binding mode of LIG_GYF. Alignment of 1L2Z chain A (GYF domain) with the GYF domain from GIGYF1 (476- 535) shows that the sequences are not very conserved. Structural superimposition of the two GYF domains reveal that the overall fold is conserved, including the majority of the binding pocket except for the hydrophobic pocket filled with a W. The peptides of the two structures have their PPPG in similar orientation. Following this sequence is a M from SNRPB that is tucked into the hydrophobic pocket and H for 1L2Z that is exposed to the environment. The sequence that follows is R for both, with the one in SNRPB exposed to the environment and possibly forming a hydrogen bond with the Q on the domain, and that in CD2 (1L2Z) forming salt bridge with an E from the domain. Later a structure of the GYF domain of GIGYF1 was published binding to a similar motif found in TNRC6 further supporting the correctness of these predictions. run63: ARHGEF9-VEZF1 Top prediction has 0.74 model confidence with the fragment from VEZF1 (375-385) making contact with the RhoGEF domain of ARHGEF9. The top predictions all put the peptide at the same binding site of the RhoGEF domain. In terms of conservation, all the peptides from VEZF1 are well conserved. Nonetheless, the prediction looks like a very questionable one, at least it seems like the predictions do not make use of the GTP/GDP binding pocket for which I did not find a structure that shows where it precisely is located but based on an abstract of an article and InterPro entries it seems to be between both structural entities that form one larger domain, the GEF domain and the PH domain (IPR000219). There is absolutely no consistency in the two peptides from VEZF1 selected to bind to the same surface on the GEF domain of ARHGEF9; VEZF1 also seems to be of very weird type, AF has a hard time to make sense out of this protein. run64: MIP-MFF Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run65: VEZF1-PRKAR1B Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run66: VEZF1-KCTD7 Top prediction involves the disordered region of VEZF1 (360-380) and the BTB domain of KCTD7. The disordered region overlaps with the top prediction in run63 that models the interface between VEZF1 (375-385) and the RhoGEF domain of ARHGEF9. Despite AF modelling a 310 helix structure in the disordered region of VEZF1 (360-380), the contacts modelled at the interface do not look very convincing. It could be that the disordered region (360-385) is a functional motif for other interactions and AF detects that and tries to fit it into the domain. It could also be that, to form the binding interface, it needs multiple copies of BTB domain, which is not used in this prediction. The VEZF1 peptide is put in the same pocket like 142 the PAX6 peptides from run23 but the sequences look different, it is however the same peptide in VEZF1 like in the prediction with ARHGEF9. run67: APTX-FLAD1 Has overlapping fragments with increasing confidence: APTX, N terminus disordered region 5-12 and 6-13, paired with MoCF_biosynth or a domain of unknown type (not matched to a Pfam or SMART domain) that is between the MoCF and PAPS_reduct domain of FLAD1. It also predicts the same N-terminal region of APTX into the PAPS_reduct domain. The disordered fragments from the region 8-15 of APTX showed high confidence model confidence but below the cutoff pLDDT score when modelled with the PAPS_reduct domain of FLAD1. Checking the structure of PAPS_reduct domain in complex with adenosine phosphosulfate shows that the peptide is modelled by AF to be in the binding pocket of adenosine phosphosulfate. This is likely a false prediction. For the N-terminal part of APTX AF is quite confident when it models it into the MoCF domain or the other unknown domain of FLAD1. There are multiple predictions with different overlapping fragments that make the cutoff. However, AF is more confident with both metrics when the peptide is modelled into the MoCF domain. This domain has a pretty substantial pocket that is actually in the monomeric structure of FLAD1 occupied by another region of FLAD1 with low pLDDT. However, when APTX 10-15 is used for modelling, the orientation of the peptide is reversed. MoCF_biosynth domain is known to trimerize for its activity and is known to bind molybdopterin. MoCF_biosynth binds molybdopterin on a site close to where AF models the peptide to be (refer to 1DI6, https://doi.org/10.1074/jbc.275.3.1814 that solves the structure of a bacterial protein with the same domain. They mentioned 49D and 82D to be important for catalytic activity) APTX with the unknown domain of FLAD1 does not reach the model confidence cutoff, only the motif pLDDT cutoff. It puts the same peptide as beta-sheet augmentation to the domain while in the predictions for the MoCF domain, the peptide is put in helical conformation. The only predictions where disordered regions in FLAD1 are predicted to bind to folded regions in APTX involve the FHA domain of APTX and correspond to two completely different disordered regions in FLAD1. run68: FBXO28-PSMC3 Top prediction is coiled-coil interaction between regions from the two proteins that are modelled by AF monomer as long helices. The plddt of all residues are very high. This interaction looks convincing. The only problem is that one helix is shorter than the other, while for a common coiled coil interaction, both helices are usually equally long. The second best prediction based on model confidence involves a disordered region from FBXO28 (51-61). The modelled complex does not look convincing because the peptide is quite hydrophobic and the residues do not make much contact with the domain. The peptide is predicted to bind to the first domain of PSMC3 which as far as I was able to find, does not have catalytic activity. There are only these two predictions that make the cutoff for model confidence, none make the cutoff when looking for disordered regions in PSMC3 predicted to bind to FBXO28. The other way round there is the peptide mentioned above and a C-terminal disordered region of FBXO28 predicted to bind to the same first domain in PSMC3 but predicted to bind to a different side. The C-terminus of FBXO28 is very charged, maybe a localization signal. Both motifs in FBXO28 are somewhat recurrently predicted to bind to the domain in PSMC3. 143 run69: CAMK2G-ESRRG Many high confidence predictions in a disordered region of CAMK2G. The whole disordered region used as a fragment for prediction also returned high confidence (0.78). In this long disordered region, AF puts the third highest model confidence peptide in the domain pocket. The top three highest confidences are very similar in terms of confidence. The motif detected by AF resembles LIG_NRBOX with the motif L..LL. CAMK2G 300-310: LKGAILTTMLV -> looks plausible to me because the M is hydrophobic and it is possible to substitute for the role of L in the regex. CAMK2G 315-325: SAAKSLLNKKS -> Also possible but the A is fitted into a quite deep hydrophobic pocket where known structure (refer to run21) shows that it is L that gets fit into the pocket. A might have too short of a hydrophobic side chain to make a good contact with the deep pocket. CAMK2G 355-365: QEPAPLQTAME -> not so good IMO because the hydrophobic contact is less extensive as the peptide found above. Another interesting observation: CAMK2G 285-423 (139 aa) prediction resulted in 0.78 model confidence, which is very high for a disordered region that long. In this case, CAMK2G 300- 310 is fitted into the hydrophobic pocket, adding weight to the fact that this could be the correct peptide. This reminds me of the extension analysis with DMI where extension of motif can improve prediction results. A pairing of ordered-ordered region prediction returned high confidence (0.83). This involves Zn finger from ESRRG and CaMKII association domain at the C terminus of CAMK2G. The binding is close to but not in the Zn binding pocket, which is good. CaMKII association domain of CAMK2 has been shown to oligomerize with other CAMK2 in 1HKX. Looking at the monomeric structure of ESRRG and CAMK2G, it looks possible that the C terminus association domain of CAMK2G to bind to ESRRG via Zn finger domain of ESRRG and the hormone receptor domain of ESRRG binds to the long and disordered region separating the two domains found in CAMK2G. This makes a multi-site binding between two proteins and a very interesting case. run70: XRCC4-LIG4 The structure for this interaction has been solved: 3II6 and 1IK9. Looking at the structure of 3II6, the two proteins interact with each other via XRCC4 first forming a homodimer with its coiled-coil domain, then around the homodimer binds the tandem BRCT domains of LIG4. The BRCT domains are separated by a structurally less defined region that most likely forms two helices upon binding to XRCC4. Not sure if this can be seen as domain-motif or domain- domain interaction, probably something in between. It is not so clear from the monomeric AF model of full length LIG4 that both BRCT domains form a functional unit but I guess one could have also made a fragment comprising both domains and the linker sequence. Runs so far were made with both BRCT domains individually and the linker sequence individually and further rerun has to be done by using the BRCT domain tandem as one structural unit. The top prediction involves a motif at the C-terminus of XRCC4 that is predicted to bind to the last BRCT domain of LIG4. I think the prediction is wrong because of the solved structure. The prediction also does not look like how other motifs bind to BRCT, i.e. the protein FANCJ (LIG_BRCT_BRCA_1). However, the C-terminus of XRCC4 certainly carries one or two motifs. One is annotated in Proviz as WD40 domain binding. The very C-terminus is a class 3 PDZ-binding motif. The whole region is very conserved. Maybe this is why AF tries to put peptides from this C-terminus in various domains, including the DNA ligase domain of LIG4 (fourth top prediction). So, the top two predictions involve this C-terminus and reach high confidences in both metrics (model confidence and intf_avg_plddt). 144 The third highest prediction involves the XRCC4 N-terminal domain plus one long helix (taken as one ordered region) and the 2nd BRCT domain. This interface is exactly the same interface that is seen in the structure 3II6 where part of the BRCT domain also contacts the XRCC4 helix. The 6th best prediction involves the linker between both BRCT domains and the XRCC4 helix. Despite the fact that XRCC4 is in monomeric form in our prediction and that the BRCT domains are missing, AF correctly models the contacts between the linker and the single XRCC4 domain as they can be seen in the structure 3II6. This model meets both cutoffs, for model confidence and pLDDT. Rerun using the BRCT domain tandem as one structural unit completed. The tandem BRCT fragment ranks 7th with the coiled coil XRCC4 fragment based on model confidence and second for ordered-ordered fragment pairs when ranked by avg interface plddt. The prediction that is still ranked first is the single BRCT domain binding to the coiled coil fragment (92 vs 89 avg intf plddt score). run71: TMEM237-MFF Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run72: HNRNPK-TH In the full length structural model of HNRNPK the first 2 KH domains are predicted to pack against each other using an interface that is also predicted to bind to the TH peptide 61-71. This region indeed overlaps with a Pfam HMM that seems to find some pattern in this disordered region but nothing is known about this “structural”(?) motif. It predicts 3 occurrences of it in the N-terminal region of TH but the third one is the most conserved and this is the one predicted to bind to the second KH domain. Two other motifs overlapping with 61-71 are also predicted to bind to this KH domain. The residues that are part of all three motifs are predicted to bind to the KH domain in the same way. One prediction below the model confidence cutoff predicts the motif to bind to the third KH domain but in a different way. run73: OTX2-RPS26 Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run74: MFF-MMGT1 Not inspected because none of the predictions returns model confidence or average interface pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). run75: PUF60-TH The top prediction involves using both RRM domains of PUF60 as one ordered region and a disordered polyA peptide from TH. The peptide is put at the same position where the Nbox would bind as shown in the NMR structure 2KXH. However, the predicted peptide has some different sequence: solved structure: LxxAxxI, model: VxxAxxV, and there are no recurrent predictions. Another prediction involves the third RRM domain of PUF60 and another peptide in TH which tugs a Trp in a pocket but it does not look very convincing. 145 Prediction involving disordered fragments from PUF60 and ordered region (Biopterin_H domain) from TH returned a maximum of 0.78 model confidence. This is likely false interface because the short peptide is fit into the biopterin and iron binding pocket of the enzymatic domain (refer to run72 for example). The second best prediction is also fitted at the same site, therefore also likely a false interface. Interestingly, the disordered region of PUF60 302-461 is modelled with 0.69 model confidence with the Biopterin_H domain of TH. The long disordered region makes contacts with two regions of the domain, one at the iron binding site (likely false) and another coiled- coil interaction at the C terminal helix of Biopterin_H domain. This coiled-coil interaction is repeated in a shorter disordered fragment of PUF60 (317-347, third best prediction (0.77), the same C terminal helix in the long disordered region). This coiled-coil interaction looks like a plausible interface. I tried finding more information about this ACT-like domain but to no avail. InterPro says that it homo-dimerizes using the beta strands like in 1Q5V, but the fold is not exactly the same. The ACT-like domain in TH is special in the way that the last beta strand is formed by its N and C termini by looping back to meet each other. I cannot find much information about this domain. run76: PUF60-QRICH1 One long disordered region of PUF60 (1-128) is modelled with high model confidence with DUF of QRICH1. In this region, 111-121 is modelled at the interface. This region when fragmented from the long disordered region also showed high confidence (0.86). This fragment tucks a R into a very deep negatively charged pocket but the rest of the peptide seems to make questionable contact with the DUF domain. Top prediction with ordered region in QRICH1 and peptides in PUF60 either put the linker helix between the first two RRM domains or the N-terminal long helix in PUF60 or another helical peptide at 442-461 at two different places on the DUF domain. I think that the helical linker between both RRM domains is not accessible for this mode of binding because the key residues are making intramolecular contacts to the RRM domains in the AF monomer PUF60 model. 3 different peptides are predicted to bind to the tandem PUF60 domain. In principle, the long disordered N-terminal region of QRICH1 is full of potential helical peptides of pattern hydrophobic-x-x-Ala-x-x-hydrophobic, which is the kind of peptide that is like the Nbox motif that can bind to PUF60 and the three different peptides are also predicted to bind to the same pocket. There are also 4 different peptides in QRICH1 predicted to bind to the third RRM domain. run77: MAB21L2-AP1S2 The top prediction involves Clat_adaptor_s domain of AP1S2 with the disordered fragment (215-220) of MAB21L2 (78 motif pLDDT, 0.77 model confidence). The motif is predicted recurrently with variable length but the disordered region is generally very short because it is a loop within the domain of MAB21L2. AF also made a disulfide bridge between motif and domain. Not sure this is correct. Looking at the structure 1W63 that shows the large Ap1 clathrin adaptor core complex where there is a fold similar to the one in AP1S2, one can see that the region where the peptide is predicted to bind would in principle be accessible for binding. This domain Clat_adaptor_s is known to bind motifs from ELMDB but no structure has been solved in terms of this domain and its bound peptide. The disordered fragments from 146 the previous point also do not match with any ELM class that binds to Clat_adaptor_s. Other good predictions use the Mab-21 domain of MAB21L2. Two overlapping disordered fragments (146-154, 0.68 and 153-157, 0.75) had good confidence with the domain but they are modelled to be at different binding sites, so it does not look likely to me that this is the binding region. run78: PRKAR1B-QRICH1 The motif in PRKAR1B is at the very C-terminus of the protein and also matches a PDZ- binding motif. There is only one prediction that makes the model confidence cutoff but it does not meet the pLDDT cutoff. The C-terminal peptide of PRKAR1B binds to the only domain of QRICH1 but extended or smaller versions of the motif are only predicted with very low score then to bind to the domain so no recurrence here. The prediction therefore looks unlikely to be functional. No other predictions make the pLDDT cutoff. 147 Figure S1 A B Incorrect C 1.0 2000 17 Acceptable 15(12%) 0.8 1500 (11%) 0.6 73 1000 31 (54%) High 0.4 (23%) 500 Medium 0.2 0 0.0 0 20 40 60 80 100 0 10 20 30 40 Pairwise domain sequence identity (global alignment) % Motif all atom RMSD (Å) ns D E F G nsns Motif all atom RMSD (Å) DockQ 15.0 15.0 1.0 12.5 40 12.5 0.8 10.0 30 10.0 0.6 7.5 7.5 20 0.4 5.0 5.0 10 0.2 2.5 2.5 0 0.0 0.0 0.0 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 With template With template DEGDOC LIG TRGMOD Helix Strand Loop ELM class Solved motif 2° structure H I J K * 8 20.0 2 7 −4 17.510 1 6 15.0 0 5 −7 10 12.5 4 10.0 −1 −103 10 7.5 −2 2 −13 5.0 1 10−3 2.5 0 0 10 20 30 40 0.00 10 20 30 40 0 10 20 30 40 X-ray Others Motif all atom RMSD (Å) Motif all atom RMSD (Å) Motif all atom RMSD (Å) Structures solved by L M 1.0 Domain length Motif length 0.8 Model confidence 0.6 Domain chain interface pLDDT 1.00 Motif chain interface pLDDT 0.4 Average interface pLDDT 0.75 0.2 pDockQ 0.50 iPAE 0.0 Domain chain interface residue 0.25 0 10 20 30 40 Motif chain interface residue 0.00 Motif all atom RMSD (Å) Residue-residue contact −0.25 Atom-atom contact Domain alignment RMSD (Å) −0.50 Motif backbone RMSD (Å) −0.75 Motif all atom RMSD (Å) DockQ −1.00 Motif probability Average motif hydropathy Motif symmetry score gth gthnce DT DT DTn n ckQ E t t e D D D iPA idu e e ) ) ) idu c cnta nta (Å (Å (Å ck Q bili ty thy re n le e a o ai otif l onf ide p L e p L e p L pD o e r es res co co SD SD SD D o p c c c c c ce ue m M M o badro y s om Mel c rfa rfa rfa rfa rfa sid -at o nt Re Rm RM f prti tif h y etr D od int e nte nte e e e e n o o m m M in in i ige in int int e-r tom o at M o y a a a a ain idu A lig nmack b l m if a l age oti f s ch ch a ain otif Av er in c h ch es in if b ot er M m M ma Mo tif R ma Mo t M o A v Do Do D Appendix Figure S1. Benchmarking of AF on DMI interfaces using minimal interacting regions. A Pairwise sequence identity of domains in the DMI positive reference dataset. B Proportion of high, medium, acceptable and incorrect models predicted by AF from the positive reference dataset as classified by the DockQ score. C Scatterplot of DockQ vs motif RMSD for DMIs from positive benchmark dataset. Pearson r = -0.85, p-value < 0.0001. D-E Motif RMSD and DockQ scores of structures for DMIs from positive benchmark dataset predicted by AF with and without the use of templates. Motif RMSD: Pearson r = 0.81, p-value < 0.0001. DockQ: Pearson r = 0.88, p-value < 0.0001. F Accuracy of AF DMI predictions stratified according to the annotated functional categories of DMIs in the ELM DB. DEG=degron, DOC=docking, LIG=ligand, TRG=targeting, MOD=modification. G Accuracy of AF DMI predictions stratified according to the secondary structure element formed by the motif in the solved structure. H-J Scatterplot of various motif features vs motif RMSD determined for models and structures of DMIs from positive benchmark dataset: H motif hydropathy, Pearson r = -0.03, p-value = 0.72, I motif symmetry, Pearson r = -0.08, p-value 148 Model confidence Without template Average motif hydropathy Frequency Without template Motif symmetry score Motif probability Motif all atom RMSD (Å) DockQ Motif all atom RMSD (Å) Motif all atom RMSD (Å) Pearson correlation coefficient = 0.38, J motif regular expression degeneracy, Pearson r = -0.04, p-value = 0.66. K Accuracy of AF DMI predictions stratified according to the method used to solve the structures in the benchmark dataset, Mann-Whitney-Wilcoxon test two-sided p-value = 0.017 test statistics = 811 L Scatterplot of model confidence of predicted models vs motif RMSD determined from superimposing the predicted models with structures of DMIs from the positive benchmark dataset. Pearson r = -0.55, p-value < 0.0001. M Correlation matrix of different prediction variables and prediction outcomes. 149 Figure S2 A 1 mutation in motif 2 mutations in motif Randomly paired DMI 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 Random Predictor 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate B 1 mutation in motif 2 mutations in motif Randomly paired DMI 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 Random Predictor 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 Motif chain interface pLDDT Average interface pLDDT pDockQ 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 C D E LIG_MYND_1: ZMYND11 & MGA LIG_MYND_3: EGLN1 & FKBP8 1.0 2ODD 2ODD 0.8 0.6 0.4 0.2 Mean of DockQ between 0.0 predicted models: 0.77 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate RPMPPKLAPGLKV LSELPPLEDMGQP F G H CREB3 (78-81) - HCFC1 THAP1 (134-137) - HCFC1 E2F1 (97-100) - HCFC1 I J Replicate 1 Replicate 2 Appendix Figure S2. Benchmarking and application of AF for DMI interface prediction using minimal interacting fragments. A Receiver operating characteristic (ROC) curve of various metrics extracted from AF models when using the DMI benchmark dataset as the positive reference and the following 150 True Positive Rate Average Precision Precision True Positive Rate sets as random reference: Left, 1 mutation introduced in conserved motif position; middle, 2 mutations introduced in conserved motif positions, right, randomly shuffled domain-motif pairs. B Precision recall curve of various metrics determined for benchmark datasets as in A. C ROC curve of mean DockQ between the top five AF structural models returned for a given input, assessed using the DMI positive reference set and random pairings of domains and motifs as in A. The AUROC of the metric is indicated in the legend of the ROC curve. D-E Superimposition of AF structural model for motif class LIG_MYND_1 (D) and LIG_MYND_3 (E) (orange) with homologous solved structures (PDB:2ODD) from motif class LIG_MYND_2 (blue). The motif sequence used for prediction is indicated at the bottom, colored by pLDDT (dark blue=highest pLDDT). F-H AF models for three motif instances (orange) of LIG_HCF- 1_HBM_1 predicted to bind into a pocket on the Kelch domain of HCFC1 (gray). Motif positions are indicated below the figures. The key tyrosines of the motif sequences are drawn as sticks. I BRET50 estimates from fitting titration curves shown in Fig 1G are plotted vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection ratio for wildtype and mutant CREBZF-HCFC1 pairs. Error bars indicate the standard error. Data is shown for two technical replicates for the first biological replicate and three technical replicates for the second biological replicate. J Fluorescence and total luminescence are shown for wildtype and mutant CREBZF-HCFC1 pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of two technical replicates for the first biological replicate and three technical replicates for the second biological replicate. Coloring as in I. 151 Figure S3 A 90 B 0.30 C 0 0 0 22.5 0.25 1 1 1 20.0 80 2 2 0.20 2 17.5 15.0 3 70 3 0.15 3 12.5 4 4 0.10 4 10.0 5 60 5 5 0.05 7.5 6 6 6 5.0 50 0.00 0 1 2 3 0 1 2 3 0 1 2 3 Domain extension step Domain extension step Domain extension step D Minimal motif + Minimal domain Minimal motif + Extended domain Extended motif + Minimal domain Extended motif + Extended domain 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 0.4 Random Predictor 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate False Positive Rate 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDT Average interface pLDDT pDockQ 0.6 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 E Minimal motif + Minimal domain Minimal motif + Extended domain Extended motif + Minimal domain Extended motif + Extended domain 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Recall 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDT Average interface pLDDT pDockQ 0.6 0.6 0.6 0.6 iPAE Residue-residue contact Atom-atom contact 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 Appendix Figure S3. Effect of protein fragment extensions on the accuracy of AF predictions. A-C Heatmap of the average motif interface pLDDT (A), pDockQ (B), and iPAE (C) for combinations of different motif and domain sequence extensions using a positive reference set consisting of 31 DMI structures. Extensions like in Fig 2A. D ROC curves (top) and corresponding AUROC values (bottom) of various metrics extracted from AF models when using the DMI extension dataset split by different combinations of motif and domain extensions as indicated on the top of each graph. Gray horizontal line indicates the AUROC of a random predictor. E Precision recall curves (top) and area under the precision recall curve as quantified by average precision (bottom) for various metrics extracted from AF models determined for benchmark datasets as in D. 152 Average Precision Precision Area Under the Curve True Positive Rate Motif extension step Average motif chain interface pLDDT Motif extension step Average pDockQ Motif extension step Average iPAE Figure S4 A 1.0 1.0 Minimal motif + Minimal domain (All) B Minimal motif + Minimal domain (Extended) RMSD RMSD 0.8 0.8 ELM classMinimal motif + Extended domain Ext 0 Ext 1 Extended motif + Minimal domain LIG_RPA_C_Vert 37.52 3.35 0.6 0.6 Extended motif + Extended domain LIG_HOMEOBOX 24.84 0.49 0.4 0.4 LIG_Pex14_3 10.84 2.77 LIG_GYF 12.47 7.42 0.2 0.2 LIG_CAP-Gly_2 5.64 0.89 LIG_NBox_RRM_1 6.46 2.09 0.0 0.0 ce T T T Q E ce T T T Q E DOC_MAPK_JIP1_4 2.11 1.07en LDD LDD D k A n D D D k A nfid c e p e p e p LD iPpDo nfid e D pL L D LD oc iP l co ac ac ac co ace ace p p D ace p de terf terf terf del terf terf terf Correct side-chain Correct pocketMo in in in in in o in in in cha cha rag e M in in e e cha cha g era Correct backbone Wrong pocket ain otif Av ain otif Av Dom M Dom M C Extension 0 Extension 1 LIG_HOMEOBOX: PBX1 & HOXB-1 (1B72) D Randomly paired DDI 1.0 Model confidence Average interface pLDDT 0.8 pDockQ iPAE Residue-residue contact 0.6 Atom-atom contact Random Predictor 0.4 0.2 TARTFDWMKVKR 0.0 0.0 0.2 0.4 0.6 0.8 1.0 DOC_MAPK_JIP1_4: MK10 & 3BP5 (4H3B) False Positive Rate E Randomly paired DDI 1.0 Model confidence Average interface pLDDT 0.8 pDockQiPAE Residue-residue contact 0.6 Atom-atom contact Random Predictor 0.4 DQFPAVVRPGSLDLPSPVSLS 0.2 LIG_Pex14_3: PEX14 & PEX5 (4BXU) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall F Randomly paired DDI 1.0 Model confidence Average interface pLDDT 0.8 pDockQ iPAE Residue-residue contact 0.6 Atom-atom contact VASEDELVAEFLQDQNAP LIG_GYF: CD2BP2 & CD2 (1L2Z) 0.4 0.2 0.0 PGHRSQAPSHRPPPPGHRVQHQPQKRP LIG_RPA_C_Vert: RPA2C & UNG (1DPU) EPGTPPSSPLSAEQLDRIQRNKAAALLRLAARNVPVGFGESWKKHLSG Appendix Figure S4. Effect of protein fragment extensions on the accuracy of AF predictions. A True and false positive rate (left and right, respectively) based on optimal cutoffs from Fig 2D derived for different metrics from ROC analysis for benchmarking AF with different motif 153 True Positive Rate False Positive Rate Average Precision Precision True Positive Rate and domain extensions from the reference dataset illustrated in Fig 2A and random pairings of domain and motif sequences. B Table indicating the motif RMSD achieved when using minimal (extension 0) or extended motif sequences for structure prediction for all inspected motif extension cases. Extension 1 refers to extension of the minimal motif sequence by the length of the motif to the left and right. Color coding indicates the accuracy classes of the respective structural models as shown in Fig 1A. C Superimposition of the structural model of the minimal (left, orange) or extended (right, yellow) motif sequence with the solved structure (motif in blue) for five different motif classes as indicated on the top of each panel. The motif sequence from the solved structure is indicated at the bottom of each panel. Motif residues are underlined, motif residues not resolved in the structure have a gray background. Sticks indicate the motif residues, domain surfaces are shown in gray based on experimental structures. D ROC curves of different metrics using the DDI benchmark dataset as positive reference and random shuffling of domain-domain pairs as negative reference. E Precision recall curves of different metrics extracted from AF models determined for benchmark datasets as in D. F Area under the precision recall curve as quantified by average precision for metrics extracted from AF models determined for benchmark datasets as in D. Gray horizontal line indicates the average precision of a random predictor. 154 Figure S5 A B C Motif all atom RMSD (Å) DockQ Model confidence 1.0 1.0 40 0.8 0.8 30 0.6 0.6 20 0.4 0.4 10 0.2 0.2 0 0.0 0.0 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 AlphaFold MMv2.3 AlphaFold MMv2.3 AlphaFold MMv2.3 D E F Motif chain interface pLDDT AF v2.2 AF v2.30.5 0.5 100 0 0 0.0 0.0 80 1 1 2 −0.5 2 −0.5 60 3 −1.0 3 −1.0 40 4 −1.5 4 −1.5 20 5 5 −2.0 −2.0 0 6 6 0 20 40 60 80 100 −2.5 −2.50 1 2 3 0 1 2 3 AlphaFold MMv2.3 Domain extension step Domain extension step G 1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT Motif chain interface pLDDT 0.8 0.8 0.8 0.8 Average interface pLDDT pDockQ Residue-residue contact 0.6 0.6 0.6 0.6 Atom-atom contact 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 AUROC AF-MMv2.3 AUROC AF-MMv2.3 AUROC AF-MMv2.3 AUROC AF-MMv2.3 H 1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT Motif chain interface pLDDT 0.8 0.8 0.8 0.8 Average interface pLDDT pDockQ Residue-residue contact 0.6 0.6 0.6 0.6 Atom-atom contact 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 AP AF-MMv2.3 AP AF-MMv2.3 AP AF-MMv2.3 AP AF-MMv2.3 Appendix Figure S5. Comparison of AF v2.2 and v2.3 prediction performance. A Scatterplot showing the motif RMSD obtained from structural models computed either with AF v2.2 or AF v2.3 using the minimal interacting regions of all annotated DMIs. B-D Scatterplots computed as in A showing the DockQ (B), model confidence (C), and motif chain interface pLDDT (D) for both AF versions. E-F Heatmaps showing the fold change in motif RMSD obtained for structural models from AF v2.2 (E) and AF v2.3 (F) upon domain or/and motif sequence extension compared to when using minimal interacting regions. Positive values indicate improved predictions from extension and negative values indicate worse prediction outcomes. G Scatterplots showing the AUROC obtained for different metrics derived from structural models from benchmarking AF v2.2 and AF v2.3 using the minimal interacting regions of all annotated DMIs or DDIs as the positive reference dataset and different random reference datasets: Left (DMI), 1 mutation introduced in conserved 155 AP AF-MMv2.2 AUROC AF-MMv2.2 AlphaFold MMv2.2 AlphaFold MMv2.2 Motif extension step AlphaFold MMv2.2 log2(RMSDmin/RMSDext) Motif extension step AlphaFold MMv2.2 log2(RMSDmin/RMSDext) motif position; middle-left (DMI), 2 mutations introduced in conserved motif positions, middle- right (DMI), randomly shuffled domain-motif pairs; right (DDI), randomly shuffled domain- domain pairs. Corresponding ROC curves for AF v2.2 and AF v2.3 are shown in Fig. S2A, S4D, and S6A. H Scatterplots as in G plotting the average precision (AP) obtained from PR curves from the same analysis as in G. Corresponding PR curves for AF v2.2 and AF v2.3 are shown in Fig S2B, S4E and S6B. 156 Figure S6 A 1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 0.6 Residue-residue contact Atom-atom contact Random Predictor 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate False Positive Rate B 1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI 1.0 1.0 1.0 1.0 Model confidence Domain chain interface pLDDT 0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT pDockQ 0.6 0.6 0.6 0.6 Residue-residue contact Atom-atom contact Random Predictor 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Recall C 1.0 1.0 1.0 Minimal motif + Minimal domain (All) Minimal motif + Minimal domain (Extended) 0.8 0.8 0.8 Minimal motif + Extended domain Extended motif + Minimal domain 0.6 0.6 0.6 Extended motif + Extended domain 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 nce DT DT DT ckQ nce DT DT DT Q ce T T T Q fide D pL pL D pLD Do ide LD LD LD p o ck ide n LDD LDD LDD oc k n l co fac e fac e efac l co nf e p e p e p pD onf e p e p e p pD ode r r r c c c c c e rfa rfa rfa l c rfa rfa rfa c M in in te in in te e in te Mo d in in te in in te nte ode i e e e g ge M in int tin in e in t ha ha a a a a a a ag ain c c otif Av er ch hn r h ai otif c Ave ain c otif ch ver Dom M m M m M A Do Do D 0.16 850 0.14 0.12 110 632 0.10 45 0.08 668 0.06 867 1115 392488 429 0.04 85 144 336 357 0.02 704 477 798 2149 39 359 0.00 C20 S1 P1 P1 SF1 R1 S2 1 1 5 9 2 0 2 2 1 7 2D NO UB EA K IP K -C A -F -K 1 - -NBB 6 -TN HE TA PLA PE X P AS D6 O L OA F6 LU E2 AIP P P P AR 6- -C C -P C F P R A -US M 2 N 1 B 2 M D CD L- SR B1 AP LR P P53 C1- AB CD T1 2-P D F2 4-NA 1 K1 -P 2-N - CN A -LD L K 0 CL 1 - R 2 M IA P OC E BU TPAB P P5- G PD O X M X B C N D2B P U PE D MD P K1AP3 C M SH Randomly paired proteins Appendix Figure S6. Performance of different metrics derived from structural models when benchmarking AF v2.3 for DMI predictions. A ROC curves obtained for different metrics derived from structural models from benchmarking AF v2.3 using the minimal interacting regions of all annotated DMIs or DDIs as the positive reference dataset and different random reference datasets: Left (DMI), 1 mutation introduced in conserved motif position; middle-left (DMI), 2 mutations introduced in conserved motif positions, middle-right (DMI), randomly shuffled domain-motif pairs; right (DDI), randomly shuffled domain-domain pairs. B PR curves computed for the same datasets and AF version as in A. C Optimal cutoff, true, and false positive rate derived for different metrics from ROC analysis for benchmarking AF v2.3 with different motif and domain extensions from the reference dataset used in Fig 2A and randomly shuffled domain 157 Fraction of fragments above threshold Precision True Positive Rate Optimal cutoff True Positive Rate False Positive Rate -motif pairs. D Fraction of fragment pairs with structural models scoring above thresholds for 20 randomly shuffled domain-motif pairs. Numbers on top of the bars indicate the total number of fragment pairs submitted for interface prediction to AF for each random protein pair. 158 Figure S7 A Replicate 1 Replicate 2 Replicate 1 Replicate 2 B C Motif iii mutated Motif iv mutated Domain mutated Replicate 2 Appendix Figure S7. Expression and BRET50 plots for TRIM37-PNKP and ESRRG- PSMC5. A Fluorescence and total luminescence are shown for wildtype and mutant TRIM37-PNKP pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three technical replicates. Data is shown for two biological replicates. B BRET50 estimates from fitting titration curves shown in Fig 4H are plotted vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection ratio for wildtype and mutant ESRRG-PSMC5 pairs. Error bars indicate the standard error. Data is shown for three technical replicates for two biological replicates each. BRET50 estimates for the second biological replicate for the ESRRG_M437F-PSMC5 pair were omitted from the graph because they exceeded the upper y-axis limit. Roman labels refer to interfaces shown in Fig 4E. C Fluorescence and total luminescence are shown for wildtype and mutant ESRRG- PSMC5 pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three technical replicates. Data is shown for two biological replicates. 159 Figure S8 A B C D Replicate 1 Replicate 2 E i L8 L152 F J G K H L Replicate 1 Replicate 2 Replicate 2 I ii I355 N134 R348 Appendix Figure S8. Structural models, expression, and BRET50 plots for STX1B- FBXO28 and STX1B-VAMP2. A BRET50 estimates from fitting titration curves shown in Fig 5C are plotted vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection ratio for wildtype and mutant STX1B-VAMP2 pairs. Error bars indicate the standard error. Data is shown for three technical replicates for two biological replicates each. B Fluorescence and total luminescence are shown for wildtype and mutant STX1B-VAMP2 pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three technical replicates. Data is shown for two biological replicates. C Data shown as in A for wildtype and mutant FBXO28-STX1B pairs relating to interface iii (Fig 5A,D). D Data shown as in B for wildtype and mutant FBXO28-STX1B pairs shown in C. E Structural model corresponding to interface i shown in Fig 5A. Mutated residues on the domain (green) and motif side are labeled. F BRET titration curves are shown for wildtype and mutant FBXO28- STX1B pairs relating to interface i shown in E with two biological replicates, each with three 160 technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. G Data shown as in A for wildtype and mutant FBXO28-STX1B pairs relating to interface i. H Data shown as in B for wildtype and mutant FBXO28-STX1B pairs relating to interface i. I Structural model corresponding to interface ii shown in Fig 5A. Mutated residues on the domain (green) and motif side are labeled. J Data shown as in F for wildtype and mutant FBXO28-STX1B pairs relating to interface ii. K Data shown as in A for wildtype and mutant FBXO28-STX1B pairs relating to interface i. L Data shown as in B for wildtype and mutant FBXO28-STX1B pairs relating to interface i. 161 Figure S9 A B C v T90 F29 L107 3MK4 D EPEX3-PEX19 disrupt PEX3-PEX19 bind F G H Appendix Figure S9. Structural models, expression, and BRET50 plots for PEX3- PEX19 and PEX3-PEX16. A Structural model of PEX3-PEX19 corresponding to interface v as shown in Fig 5G. Mutated residues on the domain (green) and motif side are labeled. B Structure from PDB:3MK4 showing the PEX19 N-terminal motif bound to the PEX3 domain. C BRET50 estimates from fitting titration curves shown in Fig 5H are plotted vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng (for PEX3 and PEX3_T90Q) or 8:50 ng (for PEX3, PEX3_R54S, PEX3_E272R) DNA transfection ratio for wildtype and mutant PEX3-PEX19 pairs. Error bars indicate the standard error. Data is shown for three technical replicates. The left panel corresponds to mutant constructs that should disrupt binding while mutants shown in the right panel were aimed to disrupt binding to PEX16 and thus should not disrupt binding to PEX19. D Fluorescence and total luminescence are shown for wildtype and mutant PEX3-PEX19 pairs measured at a 2:50 or 8:50 ng DNA transfection ratio (see panel C). Error bars indicate STD of three technical replicates. E Structural model obtained with AF for the trimeric complex of PEX3 (gray), PEX19 (yellow), and PEX16 (orange) using full length sequences as input. F PEX3 expression levels measured in luminescence units plotted for co-transfections with increasing PEX16 protein amounts measured in fluorescence units. Error bars indicate STD of three technical replicates. G PEX3 expression levels measured in luminescence units plotted for co-transfections with increasing PEX19 protein amounts measured in fluorescence units. Error bars indicate STD of three technical replicates. H Data shown as in D for wildtype and mutant constructs of PEX3-PEX16 pairs. Measures are taken for 2:25 ng DNA transfection ratios. 162 Figure S10 A B GIGYF1 mutants - repl. 1 GIGYF1 mutants - repl. 2 SNRPB mutants - repl. 1 SNRPB mutants - repl. 2 C D GIGYF1 mutants - repl. 1 GIGYF1 mutants - repl. 2 Appendix Figure S10. Expression and BRET50 plots for SNRPB-GIGYF1. A BRET50 estimates from fitting titration curves shown in Fig 6D are plotted vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection ratio for wildtype and mutant SNRPB-GIGYF1 pairs. Error bars indicate the standard error. Data is shown for three technical replicates for two biological replicates each. B Fluorescence and total luminescence are shown for wildtype and mutant SNRPB-GIGYF1 pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three technical replicates. Data is shown for two biological replicates. Coloring as in A. C Data shown as in A for wildtype and mutant SNRPB-GIGYF1 pairs fitted from titration curves shown in Fig 6E. D Data shown as in B for wildtype and mutant SNRPB-GIGYF1 pairs shown in C. 163 Chapter 3 Systematic domain-motif interaction interface and variant characterization using protein interaction profiling 3.1 Development of domain-motif interface predic- tor tool To address the lack of mechanistic information on PPIs and the limita- tion of the current bioinformatic tool in the prediction of PPI interfaces, our former PhD student designed the DMI predictor tool. Here I will discuss the workflow of the tool, its performance and its application on HuRI interactome. 3.1.1 The workflow of the DMI predictor The pipeline employed the UniProt identifiers for a pair of interacting proteins (e.g. A & B). Within these protein sequences, it uses Hidden Markov Models (HMMs) to identify the presence of known motif-binding domains. At the same time, regular expressions are applied to detect the occurrence of known motifs. Using a list of DMI types from the ELM database, the pipeline pairs the identified domains and motifs to generate putative DMI matches (Figure 3.1 A). These DMI matches are then annotated with features such as ANCHOR and IUPred scores (the propensity of motif disorderliness and the tendency to undergo a secondary structure upon binding with a partner), RLC score (motif con- servation score across orthologs), the degeneracy of motif types based on their regular expression, the enrichment of the binding domain in the in- teraction partners and frequency of motif-binding domains (Figure 3.1 B). The matches are then scored using a Random Forest (RF) model. 164 To train and evaluate this model for predicting DMIs, a positive ref- erence set (PRS) and several versions of a random reference set (RRS) were generated. The PRS is based on the 830 known DMI instances from the ELM database, while RRS was created by randomly pairing proteins and scanning for DMI occurrences (Figure 3.1 B). Each RRS version was paired up with the PRS to train separated RF models, and the performance was evaluated on test sets. Among these models, ver- sion 4 generated by randomly sampling DMI instances from the entire human interactome showed the best performance showing the Area Un- der the Curve (AUC) of 0.93 for both ROC and precision-recall curves. A cutoff score of 0.7 was established as the high-confidence DMI pre- diction, resulting in a sensitivity of 66.3% and a specificity of 97.2% (Figure 3.1 B). The pipeline outputs the DMI matches along with their scores with higher scores indicating a greater likelihood of being correct (Figure 3.1 A). 3.1.2 The application of the tool on HuRI PPI dataset The developed DMI tool was applied to the HuRI dataset to detect PPIs potentially mediated by the DMI interface. Due to the inherent degeneracy of motifs, a large number of DMI matches were found within HuRI PPIs. After applying the cutoff of predictions with high confidence DMI match score (0.7), 13,406 high-confidence putative DMI interfaces are identified across 3,195 PPIs. Among these interactions, 54% had their top-ranked matches from the ligand (LIG) classes, and almost 20% DMI matches from the modification (MOD) class (Figure 3.1 C). 165 Figure 3.1: The development of DMI predictor and its application on HuRI. (A) Schematic illustrating the workflow of the developed DMI predictor. Here is the improved list of DMI types and trained Random Forest (RF) model incorporated into the DMI detection pipeline. (B) The top panel represents the assembly of PRS and different versions of RRS. The middle panel illustrates the annotation of features on the PRS and RRSs. The bottom panel represents the ROC curve of RF models trained using different sets of RRS. For each RRS version, ROC and PRS curves averaged across the triplicates of the RRS version were plotted by interpolation. ROC for the PRS curve of the RF models. The importance of different features to the RF trained using the PRS combined with the RRSv4 as quantified using mean decrease in impurity. (C) The developed DMI predictor was applied on PPIs that are detected in HuRI, and the scores of the predicted DMIs are titrated over increasing cutoffs. The dashed lines refer to the right y-axis, while the filled line refers to the left y-axis. The red vertical line implies the cutoff of 0.7 applied on the DMI scores to call a predicted DMI of high-confidence. 3.2 Integrating ClinVar mutation data with puta- tive DMIs mapped on HuRI The largest mutation database ClinVar (see Chapter 1, section 1.1) contains a comprehensive set of patient mutation data. My colleague processed this dataset by mapping mutations to proteins and applying 166 several filtering steps to the most recent version of ClinVar. The filtering process included only germline, non-synonymous single nucleotide vari- ants (SNVs) with definitive clinical significance, excluding other variant types such as termination mutations. As a result, we have a total of 996,697 variants. Out of them, 45,035 are pathogenic, 73,806 are benign and 824,374 are variants of unknown significance (VUS). The filtered variants were then overlapped with high-confidence domain-motif interfaces (DMIs) mapped on PPIs, focusing on those where at least one pathogenic or VUS variant falls within a predicted DMI. The PPI subset was visualized using a network tool, Cytoscape. We identified a total of 6,057 potential high-scored DMIs with at least one pathogenic or VUS mutation falling in the interface (Figure 3.2 A). As the subset is big for visualization and does not represent the details I zoomed out PPIs of HDAC4 and SPOP to show how it looks. Here HDAC4 has 6 partners with 6 high-confidence DMI predictions, where 5 partners might mediate the interaction through the LIG motif type interface and one interaction potentially occurs through the DOC type motif interface. Another protein SPOP has 6 interactions with 3 DEG and 3 DOC motif type interfaces (Figure 3.2 A). Among the DMIs in this subset, the most common SLiM type is LIG, with 2,867 instances, followed by MOD with 1,838 instances, and DOC with 881 in- stances. The least frequent SLiM types are TRG (304 instances), DEG (137 instances), and CLV (30 instances) (Figure 3.2 B). A B Figure 3.2: PPI network with predicted DMIs overlapped with ClinVar mutations. (A) PPI network illustrating the mapped predicted high-confidence DMIs with at least one pathogenic or VUS mutation overlapped. The blue nodes represent proteins, and the edges indicate the predicted DMI. The colors represent different SLiM types. (B) The bar plot illustrates the distribution of different SLiM types across the PPI network illustrated in A. Each SLiM type . 167 3.3 The data-driven approach to select disease- associated proteins and PPIs suitable for the experimental validation of DMIs To select PPIs suitable for the experimental validation of putative DMIs I employed a data-driven approach annotating the PPIs with the subset with experimental features with information regarding available ORF sequences for these genes, which is important for candidate selection for experimental work. For this, I explored our ORFeome collection database to gather in- formation on the presence of the clone in the ORFeome collection. As it is essential to design an experiment close to native biological conditions, I selected full-length ORFs. Furthermore, the established pipeline im- plies the use of clonal ORFs to have a high success rate in cloning and sequence validation. Additionally, I assessed the number and types of mutations present at each interface mapped on PPIs. Understanding the biological processes regulated by proteins encoded by these genes is also crucial. To do this I imported this information from UniProt and annotated the PPIs in the subset. Analyzing the can- didates, I also checked how many partners these genes have. Given this biological information, I manually assessed the validity of the DMI pre- diction results. Some DMIs, despite having high match scores, did not align with current biological understanding. For example, we predicted an interface (DMI match score 0.741) involving WW domains and the DOC_WW_Pin1_4 motif between WWOX and MYOZ2. Pin1 is a multidomain protein with both a WW domain and a PPIase domain that work together to target specific sequences. The WW domain of Pin1 recognizes phosphorylated S/T-P motifs, while its PPIase activity regulates various cellular processes. However, this prediction might be inaccurate because, although both Pin1 and WWOX contain WW domains and are involved in disease pro- cesses, their functions are distinct. Pin1’s role as a PPIase with specific substrate targeting and isomerization activity sets it apart from WWOX, which does not perform isomerization but rather functions through pro- tein interactions. The possibility that a highly-scored DMI might still be incorrect highlights the need for further refinement of the tool and underscores the importance of experimental validation, which is the next step in our proposed strategy. As a result, I selected 31 annotated gene candidates. I applied the same approach and selected 105 gene partners. The selected candidates and partners form the network of 117 protein-protein interactions illus- 168 trated in (Figure 3.3 A). In this PPI network, 86 PPIs are mapped with predicted 88 domain-motif interfaces and 27 PPIs (found in HuRI) mediated by known 31 DMIs previously studied and annotated in the ELM database (see chapter 1 section 1.2) serve as positive controls for DMI validation. Since some candidates only had predicted interfaces, I included 4 additional partners where interactions (not found in HuRI) are mediated by known DMIs reported in the literature. Additionally, I included 78 partners that interact with the candidates via different interfaces, which will serve as negative controls. 3.3.1 Retestement of PPIs using BRET assay We first cloned and sequence-verified the selected candidates to confirm protein expression after transfection into mammalian cells. If success- ful, we can then clone their interacting partners, making the cloning step more efficient. Our prior experience with the BRET assay indi- cated that proteins are better expressed when fused to Nanoluc (NL) luciferase at the N-terminus. Therefore, the candidates were genetically fused to the NL tag. Using the established cloning pipeline from Aim 1 (see Chapter 1 section 1.5) I successfully cloned ORFs for 19 can- didate proteins. For the failed candidates, a second round of cloning was attempted. However, 3 ORFs yielded no growth in inoculation cul- tures, while sequencing of the remaining 8 showed either empty vectors or incorrect ORFs, suggesting that it may have happened due to cross- contamination. These results also showcase that the cloning step, par- ticularly the manual picking of the colonies might lead to false-positive results. For these successfully cloned ORFs, 114 partner ORFs were fused N- terminally to mCitrine, and 96 ORFs were successfully cloned, resulting in 96 PPIs available for detection in the BRET assay. As a result, we obtained significant BRET signals for 46 of these 96 PPIs with the valid expression of proteins (Figure 3.3 B). This retest rate surpasses those of gold-standard PPI datasets used in previous benchmarks of various binary PPI assays, including the BRET assay, highlighting the overall detectability of PPIs from HuRI (Trepte et al. 2018; Braun et al. 2009; Choi et al. 2019). We obtained significant BRET signals for 46 ( 48%) of these 96 PPIs with proteins expressed higher than the cutoff. This retest rate is notably higher compared to the retest rates of gold standard PPI datasets used in past benchmarking of various binary PPI assays, including this BRET assay, highlighting the enhanced detectability of PPIs from HuRI (Trepte et al. 2018; Braun et al. 2009; Choi et al. 2019). 169 Among these 46 PPIs, we selected 23 interactions involving 6 can- didates (CTBP1, WWOX, PPP3CA, REPS1, SPOP, and IQCB1) for validating the predicted interfaces (Figure 3.3 B). The remaining 24 PPIs were not selected for further analysis because they involved 8 can- didates with incomplete data, either missing known DMIs or consisting only of negative controls. For instance, PUF60 was detected with PPIs mediated solely by known DMIs or through different interfaces Figure 3.3: Experimental validation of predicted DMIs on PPIs. (A) PPI network illustrating selected DMI predictions and experimental retesting in BRET assay. (B) cBRET, total luminescence and fluorescence for 96 PPIs, where 31 PPIs have putative DMIs. Luminescence and fluorescence measurements indicate NL and mCit fusion protein expression levels, respectively. Black horizontal lines indicate expression level and PPI detection cutoffs. The gray vertical line separates the detected (left) from undetected PPIs. Protein pairs in bold indicate those selected for interface validation via site-directed mutagenesis. Error bars indicate STD of three technical replicates. To design mutants for the experimental DMI validation and variants from patients (see section 3.4) I used the predicted interface AF-MM structures run by my colleague. I visualized the predicted structures with the protein structure visualizing tool, PyMol to guide the design. We manually designed single point mutations at potential motif and domain sites of interacting protein pairs, along with deletions of motifs or regions, resulting in 2-4 mutations per motif and domain. In total, we designed 55 mutations that fall into the predicted DMI and likely disrupt the interaction. Next, I cloned the designed mutations using adapted to medium- throughput site-directed mutagenesis (see Appendix 5.1.2) and suc- cessfully cloned 44 mutants, 18 for domain and 27 mutants for motif validation. The expression of the mutated proteins was tested and com- pared to wild-type proteins (see Appendix, Figure 5.1 ). Mutants with low expression (e.g. motif deletion of LITAF) might interfere less 170 or not at all with the protein, potentially leading to false negative re- sults. Consequently, these low-expressing mutants were excluded from further validation. The successfully cloned and expressed protein candidates and their partners were used further for DMI validation using BRET saturation assay (Trepte et al. 2018; Lee et al. 2024). In this assay, I gener- ated mutated constructs and performed a donor saturation experiment, where the amount of NL-candidate ORF construct (1 and 2ng) encod- ing NL-fused proteins, were co-transfected with increasing amounts of mCit-partner ORF (12.5, 25, 50, 100, 200 ng) encoding mCitrine-fused proteins performing in total 6 measuring points. Thus, with an increased concentration of acceptor protein, the BRET signal should increase un- til it attains a saturation value called maximum BRET. This saturated BRET value is reached when all the donor molecules interact with the acceptor molecule. 3.3.2 Testing the localization of the wild-type proteins and mutants using Bioluminescence Imaging As was mentioned in section 1.5 the disruption of protein-protein inter- action may happen due to the mislocalization of the mutant rather than the effect of the mutant on the interaction. One of the advantages of the BRET assay is that the tags for interaction testing can also be used to monitor protein location within the cell and the BRET signal can even be visualized in live cells via bioluminescence imaging, shortly BLI (Goyet et al. 2016; Kobayashi et al. 2019). It was also shown that it can be scaled up using a high-content screening (HCS) microscopy (J. Kim et al. 2016). Thus, with the support of the microscopy core facility at IMB, we were motivated to perform BLI by using a 96-well plate format on an HCS microscope, named Opera Phenix. To do this, we selected some of those mutants for DMI vali- dation (TGIF1_24_28del, DMRTB1_21_25del, CPSF6_323_327del, FAM167A_3_9del) as well as patient variants (DMRTB1_R25H, WWOX_H37D,LITAF_Y61D, FAM167A_V8M) that showed the ef- fect on the binding affinity of the interactions compared to the wild-type (see subsection 3.3.3). The selected mutants and variants, paired with wild-type partners at a ratio of 10:10 ng, were transfected into pre- seeded U2OS cells in a 96-well plate using Fugene as the transfection agent. Upon transfection, cells were incubated for 24 hours. The follow- ing day, DRAQ5 and CellMask dyes were applied to stain the nucleus and cytoplasm, respectively (data not shown), and the cells were im- aged immediately using the Opera Phenix system. Initially, fluorescence 171 was imaged in each well. To detect luminescence, furimazine substrate (from the Nano-Glo kit) was added to the wells, enabling the oxidation of NanoLuc luciferase for luminescence detection. Below, I will first discuss the results of validating predicted inter- faces and microscopy data. For the negative controls, which lack re- solved structures, we employed the AF-MM fragmentation approach (see Chapter 2, Article II) to predict potential interfaces. This method helps us infer interaction sites in the absence of structural data, providing insights into the validity of our predictions and the reliability of the negative controls. 3.3.3 Validation of DMI predictions Experimental validation of interfaces involving CTBP1 inter- actions CTBP1 is a transcriptional co-repressor. Unlike many transcription fac- tors, CTBP1 does not directly bind DNA (Filograna et al. 2024; Valente et al. 2013). Instead, it interacts with transcription fac- tors through a hydrophobic cleft in its substrate-binding domain, which recognizes the PxDLS motif. This cleft is crucial for recruiting other corepressor components such as histone deacetylases (HDACs), methyl- transferases (HMTases), and additional transcriptional repressors neces- sary for its repressor activity (Filograna et al. 2024; Valente et al. 2013). For the CTBP1 candidate, we have cloned the partners with the same interface LIG_CtBP_PxDLS_1 class, TGIF1 and IKZF1 with known DMIs, partner DMRTB1 with predicted interface and CTBP2 as a negative control, meaning that this interaction likely happens through a domain-domain interface. CTBP1-TGIF1 CTBP1 binds to the PLDLS motif of the transcription factor, TGIF1 (Figure 3.4 A). This interface has been functionally studied and anno- tated in ELM (Melhuish 2000), but no crystallized structure is avail- able. The predicted AF-MM structure with a high confidence score of 0.8, suggests that the proline at position 24 of TGIF1 fits well into the hydrophobic pocket of CTBP1. Furthermore, two leucines contribute to beta augmentation, allowing the sidechain of the motif to enter a deep hydrophobic groove. In addition, a negatively charged aspartate is in proximity to phenylalanine, a non-polar hydrophobic residue (Figure 172 3.4 B). This suggests that phenylalanine’s aromatic ring might be in- volved in pi-stacking interaction by stabilizing the interface. I mutated residues A41 and C27 in the CTBP1 binding pocket that interacts directly with the motif, as well as a residue K54A, which is away from the motif and likely will not affect the binding (Figure 3.4 B). Additionally, we deleted the motif from TGIF1 to potentially completely disrupt the interaction (Figure 3.4 A). The BRET data showed that mutations A41D, C27E, and C27D in CTBP1 completely disrupted the interaction with TGIF1 (Figure 3.4 E), whereas the K54A mutation did not disrupt the binding. The deletion of the motif in TGIF1 showed the loss of interaction. The expression data is shown in the Appendix (Figure 5.2 A ). The microscopy data suggests that the mutant with the removed motif is localized similarly to the wild-type (Figure 3.5) 173 Figure 3.4: DMIs mediating CTBP1-centric PPIs (A) The schematic illustra- tion of CTBP1 interactions mediated by predicted LIG motif type (green) predicted (the arrow end pointing to motif) and known (half-circle end pointing to motif) DMIs, and negative control (gray) interaction mediated by different interface, DDI. (B-D)Predicted by AF-MM interface interaction structures of CTBP1 with TGIF1 (B, known DMI), IKZF1 (C, known DMI), and DMRTB1 (D, predicted DMI). The interacting CTBP1 domain (gray) with highlighted residues (blue), mutated for do- main validation, is shown. Motifs (green) and flanking regions (white) are indicated for each interaction. (E-G) Experimental confirmation of known DMIs (E, CTBP1- TGIF1), (F, CTBP1-IKZF1) and validation of putative DMI (G, CTBP1-DMRTB1) using saturation assays, with BRET measured as a function of acceptor/donor ex- pression ratio. The left panels show saturation curves for wild-type CTBP1 and single-point mutants (A41D, C27E, C27D, K54A) for domain validation. The right panels display binding curves for the wild-type partner proteins (TGIF1, IKZF1, and DMRTB1) and their mutants with deleted motifs. (H-I) Predicted structure of the negative control PPI using the AF-MM fragmentation approach (H), where CTBP1 (gray) and CTBP2 (dark gray) interacting domains are shown, with CTBP1 residues (blue) mutated for domain validation and experimental validation of the CTBP1 domain being part of the DDI interface between CTBP1 and CTBP2 (I). CTBP1-IKZF1 Another notable interaction involves CTBP1 and the PEDLS motif of the transcription factor IKZF1 (Figure 3.4 A). It is a DNA-binding protein that regulates transcription through association with HDAC- dependent and independent complexes. The previous study tested if the conserved PEDLS motif in IKZF1 was crucial for this interaction by creating mutations that either deleted this sequence or altered its core amino acids (Koipally 2000). The mutated IKZF1 proteins failed to 174 bind CTBP1, confirming the importance of the PEDLS motif for their interaction. Similar to TGIF1, the motif was predicted to fit the cleft of the CTBP1 domain (Figure 3.4 C). However, negatively charged glutamate on the IKZF motif and positively charged lysine on CTBP1 forming an electrostatic interaction. Therefore we expect that K54A might slightly affect the binding. Validation with the same CTBP1 mutants and deletion of the IKZF1 motif showed only partial disruption of the interaction. This partial per- turbation suggests that while the PEDLS motif is crucial, other factors may also contribute to the binding stability (Figure 3.4 F). For exam- ple, IKZF1 contains zinc finger domains essential for DNA binding and dimerization (Figure 3.4 A), which might still interact with CTBP1 through these domains. This hypothesis needs to be tested with addi- tional downstream experiments such as mutation of zinc finger domains of IKZF1 protein. The obtained microscopy data indicate that the lo- calization of the mutant with the removed motif in IKZF1 is similar to the wild-type protein (Figure 3.5 B). A NL-CTBP1 mCit-TGIF1 NL-CTBP1 mCit-TGIF1 Merged C NL-CTBP1 mCit-DMRTB1 NL-CTBP1 mCit-DMRTB1 Merged NL-CTBP1-mCit-DMRTB1 NL-CTBP1-mCit-TGIF1 NL-CTBP1-mCit-DMRTB1_174_178del NL-CTBP1-mCit-TGIF1_153_157del B NL-CTBP1 mCit-IKZF1 NL-CTBP1 mCit-IKZF1 Merged NL-CTBP1-mCit-IKZF1 NL-CTBP1-mCit-IKZF1_34_38del Figure 3.5: The localization of wild-type and mutants. Bright-field mi- croscopy image of U2OS cells showing luminescence (magenta) indicating the pres- ence of NL-CTBP1 and fluorescence intensity (cyan) of mCit-TGIF1. The images depict the localization of the wild-type proteins (top panel) and the mutant with the removed motif (bottom panel) relative to the wild-type. Scale bar = 10 µm. CTBP1-DMRTB1 The same domain of CTBP1 was predicted to bind a PLDLR motif of DMRTB1 with a high score of 0.881. This motif is annotated in ELM 175 but is found in murine HDAC9. However, this potential interface be- tween CTBP1 and DMRTB1 has not been discovered yet (Figure 3.4 A). The AF-MM structure also looks very promising. The PLDL part of the motif fits the hydrophobic pocket of CTBP1 similar to known SLiM instances mentioned earlier, while positively charged arginine residue and glutamate on the domain form salt bridges that contribute to the stabilization of the interaction (Figure 3.4 D). The BRET data sup- ports the prediction findings, with domain mutations significantly re- ducing the binding and the deletion of the motif leading to the loss of interaction (Figure 3.4 G). We also showed that DMRTB1 mutant is localized similarly to the wild-type protein (Figure 3.5 C), while the expression data shows that the Depression of DMRTB1 is slightly higher compared to the wild-type (see Appendix, Figure 5.2 A). CTBP1-CTBP2 As a negative control, we confirmed the PPI of CTBP1 with CTBP2. Although CTBP1 and CTBP2 proteins share 78% amino acid identity and 83% similarity, there are slight differences in their sequences that contribute to their distinct functions (Ding et al. 2020). For example, CTBP2 has a nuclear localization signal at the N-terminus but lacks a PDZ-binding domain. Previously, it was demonstrated that both CTBP1 and CTBP2 contain an NADH-dependent homo- and hetero- dimerization domain, which facilitates dimerization in response to in- creased NADH levels (Figure 3.4 A). This dimerization further pro- motes the nuclear retention of CTBP1. Currently, there is no resolved structure available for the interac- tion interface of CTBP1 and CTBP2. To address this, we employed a fragmentation approach using AF-MM. The AF-MM prediction was consistent with previous studies, suggesting that the domains of CTBP1 interact with CTBP2 to form a dimer (Figure 3.4 H). BRET assay in- dicated that single-point mutants on the PxDLS-binding cleft do not disturb the binding, suggesting that this cleft is not essential for the dimerization of CTBP1 and CTBP2, which is also in line with the pre- dicted structural model (Figure 3.4 I). Experimental validation of interfaces involving WWOX inter- actions WWOX is a putative oxidoreductase: it has two WW domains (WW-1 and WW-2) maintaining many interactions, NLS and an SDR (steroid 176 dehydrogenase) domain involved in metabolism. DMI predictor tool predicted that WWOX binds via the same interface LIG_WW_1 and LIG_WW_3 SLiM classes with LITAF (known as two DMI interfaces), CPSF6 (three interfaces), and DAZAP2 (one DMI). Additionally, neg- ative control partners HOXA1, CSNK2B and SNRPC are used. WWOX-LITAF LITAF plays a role in endosomal protein trafficking and targets proteins for lysosomal degradation (Lee et al. 2011). It consists of two short PPSY motifs at the N-terminus and SLD domain with a hydrophobic cysteine-rich core region anchored to the membrane of the lysosome (Figure 3.6 A). Previously, it found that the WW-1 domain binds specifically to PPSY motifs in LITAF, whereas the WW-2 domain does not (Ludes-Meyers et al. 2004). Using AF-MM, we predicted a high-confidence (0.8) structural model of the interface between the first motif (20-23) and tandem WW do- mains. The structure suggests this motif is recognized by WW-1 (Figure 3.6 B (i)). We also predicted the structure (with the same confidence score) of the second known interface of the second PPSY motif (58-61) and tandem WW domains. Similarly, the second motif prefers interaction with the WW-2 domain (Figure 3.6 B (ii)). The prolines and tyrosine residues on the motif fit into the pocket WW1 containing tryptophan and tyrosine (Figure 3.6 B). The prediction indicates that both motifs might interact with the WW1 domain, though they bind in the same manner, suggesting mul- tivalency, where multiple interactions between identical (by sequence) motifs and one domain occur. To confirm these interfaces, we designed mutated residues on the WW1 domain and motif (Figure 3.6 B). In addition, I also generated motif deletions, each separate and N-terminal part. However, the expression levels were lower than the threshold (see Appendix Figure 5.2 B) and we excluded these deletions for further study. Experimental validation showed partial disruption of binding with mutations Y33H, Y33D, and W44K in WW1, while mutations on pro- lines and tyrosines in the motifs had varying effects. Replacement of ty- rosines in motifs with aspartate completely disrupted binding (Figure 3.6 F). In contrast, Ludes-Meyers et al. (2004) demonstrated that mutating tyrosine to alanine on the first motif significantly reduces binding, while mutating tyrosine to alanine on the second motif does not affect bind- ing. Alanine is a small, non-polar amino acid that lacks the aromatic 177 side chain of tyrosine. The loss of this aromatic interaction might sig- nificantly reduce but not eliminate the binding, as observed in Meyers’s study. Figure 3.6: DMIs mediating WWOX-centric PPIs (A) Schematic illustra- tion of PPIs mediated by DMIs. The edge ending points towards the predicted motif, where the arrow implies predicted DMI, while the half-circle points to the known DMI and gray indicates interaction mediated by different interfaces, where the question means that this interface was predicted by AF-MM using fragmenta- tion approach. (Bi) Predicted interface interaction structure of the WW1 domain with the first PPSY motif in the WWOX-LITAF interaction. The structure high- lights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (Bii) Predicted interface interaction structure of the WW-1 domain with the second PPSY motif of WWOX-LITAF interaction. (Cii) Predicted structure illustrating the second motif on CPSF6 and tandem WW domains as shown in A scheme. (Ciii) Predicted interface interaction structure of the WW-1 domain with the third motif of WWOX-CPSF6 interaction. (D) The putative model of the motif on DAZAP2 and tandem WW domains. (E) Predicted interface of the negative control PPI. (F) Experimental confirmation of known DMIs of WWOX-LITAF using BRET saturation assay. (G) Experimental validation of predicted DMIs of WWOX-CPSF6 using BRET saturation assay. (H) Experimental validation of putative DMI of WWOX-DAZAP2 using BRET saturation assay. (I) Experimental validation of whether the domain is involved or not in the interface of the negative control. WWOX-CPSF6 DMI predictions identified three potential interfaces between WWOX and CPSF6: one with the PPPY motif and two with the TPPRP and FPPRP motifs located at the C-terminus of CPSF6 (3.6 A). One study 178 identified several novel interactions involving WW domains using mass spectrometry. They found that CPSF6 is associated with the WW-1 domain of WWOX. They further investigated whether specific proline- based peptide motifs are present in proteins bound by WW domains and found that CPSF6 contains PPPY motif as a potential interface between WWOX and CPSF6. However, no validation of this interface was done in this study (Ingham et al. 2005). AF-MM predictions indicated that the FPPRP motif binds to the WW2 domain (Figure 3.6 C (ii)), while the PPPY motif is more be- tween WW domains (Figure 3.6 C (iii)). BRET experiments showed that single-point mutants of residues on the WW-1 domain significantly disrupted the binding. (Figure 3.6 G, right panel). Moreover, the mutant with removed motif FPPRP on CPSF6 partially disrupted the binding, similar to the effect of the mutant with the deleted third motif, PPPY, on CPSF6 (Figure 3.6 G, left panel). Given our predictions and experimental data, one might speculate that the mutant with all removed motifs might completely disrupt the binding. WWOX-DAZAP2 The same DMI interface of LIG_WW_1 class was predicted by the DMI predictor tool with a DMI match score of 0.9 for tandem WW domains of WWOX and PPAY N-terminal motif in DAZAP2 (Figure 3.6 A). AF-MM model of the interface proposes the PPAY motif to fit well the hydrophobic groove formed on the WW-1 domain. In the WW1 domain, the tryptophan residue (W44) and tyrosine residue (Y33) are positioned in a way that allows aromatic stacking with the proline residues of the PPAY motif (Figure 3.6 D). The involvement of the WW-1 domain in the predicted interface was experimentally validated, demonstrating the reduction in binding of domain mutants (Figure 3.6 H, left panel). The deletion of the predicted motif on DAZAP2 slightly affected the interaction with WWOX (Figure 3.6 H, right panel). WWOX-HOXA1 I used HOXA1 as the negative control, assuming the interaction is me- diated via a different interface (Figure 3.6 A). No DMI prediction was found on this interaction upon the application of the DMI predic- tor tool. HOX1 does not contain PPxY (where x represents any amino acid), PPLP or xPPRX motif recognized by WW domains. Using the fragmentation approach, my colleague predicted the potential interface. The WW tandem domain in WWOX is modeled with the disordered re- 179 gion 294-302 of HOXA1 with moderate confidence (pLDDT 73) (Figure 3.6 E). Here the 294-300 (PISPATP) of HOXA1 matches the regex of LIG_SH3_3 that binds to the SH3 domain. According to the predicted structure, two prolines at the C terminal of the peptide stack nicely with aromatic sidechains (W and Y) from the WW domains in a similar way as the LIG_WW_1 class. However, BRET experiments showed WW1 mutants did not change the binding with HOXA1, meaning that this domain might not interact (Figure 3.6 K). This data also showed the limitation of AF-MM in specificity. WWOX-SNRPC Another negative control is WWOX and its partner SNRPC in BRET (see Appendix). However, we had a DMI prediction that slightly scored below the cutoff, where the LIG_WW_3 motif within the proline-rich region of SNRPC was predicted to bind to the WW1 domain. The titra- tion studies showed that the mutant constructs on the domain and the deletion of the potential motif as well as the whole proline-rich region left the interaction intact. With these conditions, we could not validate this interface prediction (see Appendix, Figure 5.3 C & D). We also tried the fragmentation approach using AF-MM, where the only promising prediction that survives the cutoff is an ordered-ordered pair. The prediction involves the Zn finger from SNRPC and the C-terminal SDR domain from WWOX (see Appendix, 5.3 A), but we did not test this prediction experimentally. Given these findings, the DMI pre- dictor returned interface prediction, suggesting that the GPPRP motif of LIG_WW_3 class binds to WW domains is likely wrong. This mo- tif is recognized by group III WW domains, whereas WWOX contains WW domains from group I. Our structural data also showed that pre- dicted.PPR. motifs from class LIG_WW_3 class are predicted to be positioned away from the binding groove. These data also point to the inability of DMI predictors to discriminate domain preference within domain class. WWOX-CNSK2B This PPI was also annotated as the negative control. Similar to HOXA1, CSNK2B does not have proline-rich stretches and we did not have any DMI predictions. AF-MM prediction was not done. BRET signals of the mutant did not differ from the wild-type protein titration results (see Appendix 5.3 E & F). 180 Experimental validation of interfaces involving IQCB1 inter- actions IQCB1 contains a tyrosine phosphorylation site, a coiled-coil region, and three helical calmodulin-binding motifs. The calmodulin-binding motif is a ligand type motif with the consensus [I,L,V]QxxxRGxxx[R,K] with characteristic residues being a hydrophobic residue at position 1, highly conserved glutamine at position 2, basic charges at positions 6 and 11, and a variable glycine at position 7. Two of these motifs are known and also annotated in the ELM database (321-336, 391-407) (X. Luo et al. 2005). Whereas the third motif (298-314) was predicted by the DMI predictor (Figure 3.7 A). The DMI tool predicted these motifs interact with the EF-hand repeat domains of CALM1 and CALML3 proteins. IQCB1-CALM1 The DMI predictor gave a high DMI match score of 0.9 and found these motifs potentially interacting with the tandem EF-hand domains of CALM1. Upon binding of four Ca ions through these motifs, CALM1 changes its conformation from a closed form to an open one, exposing a hydrophobic surface capable of interacting with different target pro- teins. AF-MM predicted the interface of the third motifs and the tandem EF-hand domains (Figure 3.7 B). The predicted model suggests that the IQCB1 motif is tightly wrapped and embedded within the binding pocket of CALM1. Validation experiments were limited due to the non- canonical isoform 2 of the IQCB1 clone, which lacks two full first and second motifs. We had one successful mutant E120H for domain valida- tion which is predicted to be away from the IQCB1 motif (Figure 3.7 B (iiia)). However experimental data showed that E120H did not likely affect the binding with IQCB1. In contrast, the deletion of the motif partially reduced the interaction (Figure 3.7 E), suggesting that while the motif is important, other factors or regions may also play a role in maintaining the overall interaction between proteins. IQCB1-CALML3 Similar to the DMI mediating interaction IQCB1-CALM1, it was pre- dicted that EF-hand domain-containing CALML3 likely recognizes the same motifs of IQCB1 (Figure 3.7 A). The AF-MM prediction showed a similar outcome (Figure 3.7 C). The BRET data supports the pre- diction showing that E85K and the deletion of the motif weakened the binding with wild-type protein pair (Figure 3.7 F). 181 Figure 3.7: DMIs mediating IQCB1-centric PPIs (A) Schematic illustra- tion of PPIs mediated by DMIs. The edge ending points towards the predicted motif, where the arrow implies predicted DMI, while the half-circle points to the known DMI and gray indicates interaction mediated by different interfaces, where the question means that this interface was predicted by AF-MM using the fragmen- tation approach. (Biii a&b) Predicted interface interaction structure of the known DMI, tandem Eh domain in contact with the third motif in the IQCB1-CALM1 interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. Ciii a&b) Predicted interface interaction structure of the known DMI, tandem Eh domain in contact with the third motif in the IQCB1-CALML3 interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (D) Predicted novel interface of the negative control PPI using AF-MM fragmentation approach. (E) Experimental confirmation of known DMIs of CTBP1-CALM1 using BRET saturation assay. (F) Experimental confir- mation of known DMIs of CTBP1-CALML3 using BRET saturation assay. (G) Experimental testing of the third motif being involved or not in the interface of the negative control. IQCB1-MNS1 One of the negative controls is the interaction of IQCB1 with MNS1. It has no folded globular region, the monomeric structure of it shows the protein is composed of long helices. There was no putative DMI returned using the DMI predictor tool. To test and verify that the motif is not part of the interface of the interaction with MNS1, I tested the motif with the removed motif in pair with the wild-type MNS1. Interestingly, BRET data showed that the deletion of the motif caused an increase in BRET. Using the AF-MM fragmentation approach, the predicted model sug- gests that helices of IQCB1 potentially bind to the C-terminal disordered region of MNS1, 292-332 (Figure 3.7 A and D). Despite a very high predictive score (0.89) manual inspection of the predicted interfaces of 182 fragments from the same region shows AF putting the fragments at dif- ferent sites (not shown). Therefore, this putative interface might be wrong. The expression data is shown in Appendix 5.4 Experimental validation of interfaces involving PPP3CA in- teractions PPP3CA is the phosphatase of type PP3, (its old name is calcineurin or PP2B) that recognizes its substrates via DOC_PP2B motifs. There are 3 catalytic subunits (PPP3CA, PPP3CB, PPP3CC) and two regulatory subunits (PPP3R1, PPP3R2). Upon increase in Ca2+ levels, it forms a complex composed of calcineurin A (catalytic subunit that is dependent on calmodulin) and a regulatory Ca2+-binding subunit (calcineurin B). PPP3CA-FAM167A The motif of DOC_PP2B_PxIxI_1 class in FAM167A (3-9 aa) is per- fectly predicted by our DMI predictor tool to bind to the calcineurin (Metallophos) domain in PPP3CA (Figure 3.8 A) AF-MM putative structure predicts the potential motif forms the contacts along the edge of two beta sheets in the calcineurin PPP3CA (Figure 3.8 C). The results showed that mutants on the domain reduced the binding , while the deletion of the motif of FAM167A completely disrupted the inter- action, while the expression of the wild-type and mutants were above the cutoff (Figure 3.8 E). Taken together, it can be suggested that FAM167A might be a potential substrate for PPP3CA. PPP3CA-PPP3R2 PPP3CA interaction with PPP3R2 is mediated by different interfaces. PPP3R2 is the regulatory subunit that binds calcium ions and modulates the activity of PPP3CA in response to changes in intracellular calcium levels. PPP3R2 contains EF-hand domains. When intracellular calcium concentrations rise, calcium binds to these domain repeats in PPP3R2, inducing conformational changes that activate PPP3CA (Figure 3.8 D). This PPI serves as a negative control in this study (Figure 3.8 A). BRET signals for the single mutants on the PPP3CA domain did not affect the interaction, potentially meaning that these residues of this domain might not contact PPP3R2 (Figure 3.8 F). Microscopy 183 data suggests that the deletion of the motif did not change localization compared to the wild-type (Figure 3.8 B). Figure 3.8: DMIs mediating PPP3CA-centric PPIs (A) Schematic illustra- tion of PPIs mediated by DMIs. The edge ending points towards the predicted motif, where the arrow implies predicted DMI, while the half-circle points to the known DMI and gray indicates interaction mediated by different interfaces, where the question means that this interface was predicted by AF-MM using fragmentation approach. (B) The localization of wild-type and mutants. Bright-field microscopy image of U2OS cells showing luminescence (magenta) indicating the presence of NL-PPP3CA and fluorescence intensity (cyan) of mCit-FAM167A. The images de- pict the localization of the wild-type proteins (top panel) and the mutant with the removed motif (bottom panel) relative to the wild-type. Scale bar = 10 µm. (C) Predicted interface interaction structure of the predicted DMI, tandem Met- allophos domain in contact with the motif in the PPP3CA-FAM167A interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (D) Predicted known interface of the negative control PPI using AF-MM fragmentation approach. (E) Experimental validation of putative DMIs of PPP3CA-FAM167A using BRET saturation assay. (F) Experimental testing of the domain is involved or not in the interface of the negative control. Experimental validation of interfaces involving SPOP interac- tions SPOP is the component of RING-based BCR (BTB-CUL3-RBX1) E3 ubiquitin-protein ligase complex that mediates ubiquitination of tar- geted proteins, leading to proteasomal degradation. It contains two 184 globular domains MATH and BTB domains. Cullin E3 ligase binds to the BTB domain while the MATH domain directly recruits the sub- strates of the E3 ligase complex for ubiquitination. In complex with Cul3, the binding of SPOP to the motif leads to the proteasomal degra- dation of the substrate. SPOP-RXRB The DMI tool predicted the MATH domain might bind to two motifs of the DEG_SPOP_SBC_1 class at the N-terminal region of RXRB protein with a DMI match score of 0.899. RXRB also contains four Zn finger repeats and an LBD domain (Figure 3.9 A). There is no solved structure of this interaction interface is resolved. The AF-MM model suggests that the SPOP and RXRB interface is promising, with the motif docking into a hydrophobic cleft on the SPOP domain (Figure 3.9 C). BRET experiments involved testing four mutants in SPOP: G132Q and F102V core mutants significantly reduced binding, whereas S119L and R70T edge mutants did not affect the interaction (Figure 3.9 H). Interestingly, both the deletion of the first motif and the deletion of both motifs resulted in BRET signals similar to the wild-type interac- tion, indicating that the interaction remained intact (Figure 3.9 H). The obtained findings suggest that the predicted motifs of RXRB were not verified with the deletion of the motifs, and the prediction of this interface is likely to be wrong. 185 Figure 3.9: DMIs mediating SPOP PPIs (A) Schematic illustration of PPIs mediated by DMIs. The edge ending points towards the predicted motif, where the arrow implies predicted DMI, while the half-circle points to the known DMI and gray indicates interaction mediated by different interfaces, where the question means that this interface was predicted by AF-MM using fragmentation approach. (Bi) Predicted interface interaction structure of the predicted DMI, where the domain is in contact with the first motif in the SPOP-RXRB interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (Biii) Predicted interface interaction structure of the predicted DMI, where the domain is in contact with the second motif in the SPOP-RXRB interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (C) Predicted novel interface of the negative control PPI using AF-MM fragmentation approach. (D) Experimental validation of putative DMIs of SPOP-RXRB using BRET saturation assay. (E) Experimental testing of whether the domain and motif are involved or not in the interface of the negative control. SPOP-MYD88 Another interaction partner of SPOP is MYD88. This partner has two globular domains Death and TIR. Slim DEG_SPOP_SBC_1, has been detected in region 12-19 (APVSSTSS) of MYD88, with DMIMatchScore 0.55. The DMIMatchScore is below the 0.7 cutoff that we set, therefore this PPI is treated as a negative control (Figure 3.9 A). Overlapping fragments that cover the core binding motif are also repeatedly modeled at the same interface with high confidence, making the interface very likely to be true. The core motif is likely 13-16 PVSS. Taking the biological function of the proteins we hypothesized that this interface might be true. Therefore we tested the previously men- tioned mutants for domain validation and the core mutants significantly perturbed the interaction. On the other side, we removed the motif and N-terminal part of MYD88 and obtained unexpected findings. The BRET experiments show that the deletion of the N-terminal part led to the lower BRET, but enhanced the binding affinity (Figure 3.9 I). 186 Based on our observations, we hypothesize that the deletion of the N- terminal part of MYD88 might changed the spatial rearrangement of the proteins, increasing the distance between the donor and acceptor fluorophores or altering their orientation. The increased distance be- tween tags is indicated by a lower BRET signal. At the same time, this deletion might increase the accessibility of the binding site for SPOP leading to enhanced binding affinity. Later I found that the previous study reported that the co-IP and ubiquitination assay showed that MyD88-VSSTS mutant still binds to SPOP and can be ubiquitinated by SPOP at levels comparable with those of wild-type MyD88. More- over, they reported that an SBClike motif (146-VDSSV-150 aa) located in the middle of MyD88 is indispensable for MyD88–SPOP interaction and SPOP-dependent ubiquitination (Li et al. 2020). Experimental validation of interfaces involving REPS1 inter- actions REPS1- NUMB REPS1 contains tandem repeats of EH domains. EH domains are ex- clusively found in proteins that function in endocytosis and vesicular trafficking and are believed to regulate these processes. They recognize proteins containing single or multiple NPF (Asn-Pro-Phe) motifs, like NUMB (Figure 3.10 A). In ELM the canonical EH binding peptide is a strongly conserved NPF motif. NUMB also contains PID and NUMB domains at the N-terminal and middle part of the protein. This inter- face is known. Proline and Phenylalanine fit the hydrophobic pocket on the EH-domain very well according to the predicted AF-MM struc- ture (Figure 3.10 C). Although BRET experiments demonstrated that W275A did not significantly affect binding (Figure 3.10 E), the ex- pression of the L271D mutant was destabilised while being co-expressed with wild-type NUMB Appendix, Figure 5.4 D. Similarly, the dele- tion of the motif as single mutants was not expressed well in my hands. Therefore, it was hard to make any conclusions regarding the interface. 187 Figure 3.10: DMIs mediating REPS1 interactions (A) Schematic illustrating REPS1 and its partners and their interactions mediated by interfaces. The edge end- ing points towards the predicted motif, where the arrow implies predicted DMI, while the half-circle points to the known DMI and gray indicates interaction mediated by different interfaces, where the question means that this interface was predicted by AF-MM using fragmentation approach. (B)Predicted interface interaction structure of the predicted DMI, where the domain is in contact with the second motif in the REPS1-TRAPPC2L interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (C) Predicted interface interaction structure of the known DMI, where the domain is in contact with the second motif in the REPS1-NUMB interaction. The structure highlights mutated residues on the domain (in blue) and on the motif (in green), with arrows pointing to these residues. (D) Experimental validation of putative DMI of REPS1-TRAPPC2L using BRET saturation assay. (E)Experimental validation of known DMI of REPS1-NUMB using BRET saturation assay. REPS1- TRAPPC2L It was predicted that EH domains of REPS1 bind to the LIG_EH_1 motif, 112-116 of TRAPPC2L. We had only one prediction and a high DMI match score of 0.883. AF-MM modeled NPF residues of the motif fitting in the deep pocket of the domain (Figure 3.10 A). although it is an interesting prediction and the motif is docked as seen in the known structure 2JXC (not shown), the confidence score was low 0.6 (Figure 3.10 B). The L323D mutation on the domain of REPS1 deeper in the pocket slightly reduced the interaction, while W327A close to the contact with edge residues of the motif did not affect the interface. The mutants on the motif, P114G weakend the binding. However, F115G and deletion of the motif did not disrupt the binding (Figure 3.10 D). It would be interesting to employ the fragmentation approach and predict the novel interface. To sum up, we could test 14 out of 20 selected DMIs across 12 PPIs. Among 14 tested DMIs, we confirmed both binding regions in 5 DMIs (CTBP1-DMRTB1, WWOX-CPSF6 (ii), WWOX-CPSF6 (iii), WWOX- DAZAP2, PPP3CA-FAM167A) and partially confirmed 1 DMI (SPOP- 188 RXRB) for its domain region only out of 7 predicted DMIs. Additionally, we re-confirmed 5 (CTBP1-TGIF1, CTBP1-IKZF1, WWOX-LITAF(i), WWOX-LITAF(ii), IQCB1-CALML3 (iii)) and the motif region for 1 DMI (IQCB1-CALM1 (iii)) out of 7 known DMIs. We also tested 7 negative controls mediated by different interfaces and showed that 5 PPIs (CTBP2-CTBP2, WWOX-HOXA1, WWOX-CSNK2B, WWOX- SNRPC, PPP3CA-PPP3R2) might bind through different interfaces, while other 2 PPIs (SPOP-MYD88, IQCB1-MNS1) showed are likely to be wrong and require further investigation to define the interface between these interactions. 3.4 The application of the strategy of the variant effect on PPIs Interaction profile for variants falling in WWOX As comparative interaction profiling is challenging due to the scarcity of pathogenic mutations on motifs and the difficulty in crystallizing disordered regions for functional studies, many reported mutations in databases like ClinVar are not functionally validated and rely on pre- dictive tools like PolyPhen-2, which showed limitations in predicting variant effect (Sahni et al. 2015). This makes the interpretation of PPI profiling uncertain. We propose a PPI-centric strategy that incorporates domain-motif interface (DMI) information that seems to be suitable to better prioritize and interpret variants, providing a clearer understanding of their impact on protein interactions and contribution to disease. To showcase the application of our strategy we characterized the variants selected for the study. WWOX, a protein involved in neural development and cancer, was chosen to explore the impact of specific mutations on its interac- tions. Three mutants￿two VUS and one pathogenic mutation￿were successfully cloned and experimentally tested (Figure 3.11 A). The pathogenic mutation, E17K, was documented in ClinVar and found in patients with developmental and epileptic encephalopathy (DEE). However, there is no evidence revealing the pathogenicity. The predic- tion was done by PolyPhen-2. According to AF-MM predicted struc- tures, this mutation on the interacting WW1 domain is not in contact with the motif, suggesting it should not disrupt interaction at this site (Figure 3.11 B). Indeed, experimental data confirmed that the BRET signal for the WWOX-LITAF interaction remained similar to the wild- type, indicating no significant impact on this PPI (Figure 3.11 D (i)). 189 E17K also did not affect the interactions between LITAF-CSNK2B, WWOX-HOXA1 and WWOX-SNRPC interactions (Figure 3.11 D (iv-vi)). The effect on interactions with DAZAP2 and CPSF6 could not be observed due to instability issues with the mutant during co- expression see Appendix, Figure 5.5 B & C, necessitating further investigation (Figure 3.11 D (iii, V)). Overall, this pathogenic muta- tion did not disrupt the interface, implying that its clinical impact may involve other processes or that the mutation is not pathogenic as stated in ClinVar. The VUS variant E17D demonstrated a similar interaction profile to the pathogenic variant E17K, showing no significant impact on interac- tions with CPSF6 and DAZAP2, though it was not tested with SNRPC (Figure 3.11 D (ii, iii, v)). 190 A WWOX B C ii iii iv D i ii iii iv v vi acceptor/donor expr E NL-WWOX mCit-LITAF NL-WWOX mCit-LITAF Merged NL-WWOX-mCit-LITAF NL-WWOX_H37D-mCit-LITAF Figure 3.11: The effect of variants falling into the interface using interac- tion profiling. (A) Schematic illustration of the functional regions within WWOX with the location of variants, where the color indicates the pathogenic (red) and VUS (gray). (B) i Predicted interface interaction structure of the WW1 domain with the first PPSY motif in the WWOX-LITAF interaction. (B) iii Predicted structure illustrating the second motif on CPSF6 and tandem WW domains. (B) iii Predicted structure illustrating the third predicted motif on CPSF6 and tandem WW domains, where the motif is in contact with the second WW domain. (B) iv The putative model of the motif on DAZAP2 and tandem WW domains. (C) The schematic PPIs illustrate interaction profiles of wild-type and mutated interaction. (D) i Experimental assessment of the variants on known DMIs of WWOX-LITAF us- ing BRET saturation assay. ii Experimental assessment of the variants on putative DMIs of WWOX-CPSF6 using BRET saturation assay. iii Experimental assess- ment of the variants on putative DMIs of WWOX-DAZAP2, iv on putative DMIs of WWOX-HOXA1, v on putative DMIs of WWOX-CSNK2B using BRET saturation assay. vi Experimental assessment of the variants on putative DMIs of WWOX- SNRPC using BRET saturation assay. (E) The microscopy experiment shows the localization of the H37D variant compared to the wild-type WWOX. The intensity of nanoLuc luciferase tagged LITAF (wild-type and mutant) was shown inverted and magenta and mCitrine tagged WWOX wild-type was shown inverted and cyan. scale = 10µm. In contrast, the VUS variant H37D in WWOX found in patients with developmental and epileptic encephalopathy and autosomal reces- 191 BRET BRET sive spinocerebellar ataxia 12 is located within the DMI interface of DAZAP2, CPSF6, and LITAF. The substitution of histidine with a neg- atively charged aspartate could disrupt interactions by interfering with a tyrosine residue within the motifs (Figure 3.11 B). Experimental data confirmed this, as H37D notably disrupted interactions mediated by this interface (Figure 3.11 D (i-iii)), while interactions with proteins such as HOXA1, SNRPC, and CSNK2B remained unaffected (Figure 3.11 D (iv-vi)). These findings suggest that this variant H37D disrupts the interaction with partner LITAF, DAZAP2, and CPSF6 (Figure 3.11 D (i-iii)). WWOX binds to LITAF, a protein involved in mediating in- flammatory responses and apoptosis. The ability of the WW1 domain to bind the motifs in these partners to regulate signaling processes. LITAF is critical for controlling inflammation and cell death. Such a disruption could hinder WWOX’s regulatory role, leading to unchecked inflamma- tory responses or improper cell death signaling, potentially contributing to disease pathology.DAZAP2 is involved in RNA processing and sig- naling pathways that regulate cellular differentiation and proliferation. The interaction with WWOX may help modulate these pathways, ensur- ing an appropriate cellular response to stress and developmental cues. The disruption of this interaction might lead to altered RNA processing or dysregulation of signaling pathways, affecting cellular homeostasis and potentially contributing to developmental disorders. CPSF6 plays a crucial role in RNA cleavage and polyadenylation, processes essential for mRNA maturation. The binding of WWOX to CPSF6 could influ- ence these processes by modulating RNA metabolism and gene expres- sion regulation. The H37D variant could prevent proper binding of the WW1 domain to CPSF6, potentially affecting the function of the CPSF complex. This disruption might have widespread effects on gene expres- sion, mRNA stability, and cellular response to DNA damage, which are critical in neurodevelopmental and neurodegenerative diseases. This result suggests that profiling variants based on shared interac- tion disruption and DMI interface impact may be an informative ap- proach to characterizing candidate disease-associated mutations. How- ever, we cannot exclude the possibility that some expressed mutants might be partially misfolded or disrupt PPIs by altering protein com- partmentalization. To test this, we used microscopy to verify whether mutant constructs alter localization compared to the wild-type protein (Figure 3.11 E). Our findings indicate that the H37D variant remains in the same cellular compartment as the wild-type WWOX, suggesting that the observed interaction disruptions are not likely due to mislocal- ization. 192 Interaction profiles of variants found in IQCB1 Mainly, mutations found in IQCB1 are associated with retinal disorders such as Senior-Loken syndrome 5 and Leber congenital amaurosis 10 (LCA 10). Many variants in IQCB1 are of uncertain significance, and the available evidence is currently insufficient to determine the definitive role of these variants in the disease. For example, the R404G variant has been identified in patients with Nephronophthisis and other inborn genetic diseases and is classified as a Variant of Uncertain Significance (VUS). Algorithms developed to predict the effect of missense changes on protein structure and function, such as PolyPhen-2 ("Probably Damaging") do not consistently agree on the potential impact of this missense change. This variant has not been reported in the literature in individuals affected with IQCB1-related con- ditions and is not present in population databases (e.g., ExAC shows no frequency for this variant). The arginine residue (R404) is highly conserved and is predicted to be positioned deep within the domain of CALM1/CALML3 (Figure 3.12 B (i)). The change of the residue at this position might potentially affect the binding affinity and specificity, disrupting critical protein-protein interactions (PPIs). Another uncer- tain variant, N406Y (Figure 3.12 A), reported in ClinVar, also falls within the same motif and can disrupt PPIs similarly (Figure 3.12 B (i)). As expected, experimental data indicate a slight reduction in binding affinity for both interactions with CALM1 and CALML3 (Figure 3.12 D (i)). Interestingly, the perturbing effect of these mutants was more pro- nounced when co-expressed with motif-binding CALML3 (Figure 3.12 D (i, top right)), suggesting a differential impact on binding efficiency between the two calmodulin-like proteins. This differential impact may reflect variations in the structural conformation or binding dynamics between CALM1 and CALML3, which could influence the pathophysi- ological consequences of these mutations. IQCB1 is involved in several cellular processes, including cilia func- tion and protein trafficking, which are crucial for maintaining pho- toreceptor cell integrity in the retina. The disruption of interactions with CALM1 and CALML3 due to mutations like R404G and N406Y could impair calmodulin-mediated signaling pathways, leading to defec- tive cilia assembly or maintenance. This disruption might contribute to the pathology observed in retinal degenerative diseases such as Senior- Loken syndrome 5 and Leber congenital amaurosis 10. Moreover, altered calmodulin interactions could affect calcium homeostasis and cellular stress response, further exacerbating disease progression in affected in- 193 dividuals. A IQCB1 B C N406Y R404G iii i ii D M110L E105K F90L D94H acceptor/donor expr acceptor/donor expr iii ii E105K D94H R87H acceptor/donor expr acceptor/donor expr iii iv acceptor/donor expr R58W A89D Figure 3.12: The effect of variants falling into the motif of IQCB1 using interaction profiling (A) Schematic illustration of the functional regions within IQCB1 with the location of variants, where the color indicates the VUS (gray). (B) i Predicted the interface interaction structure of the Eh domain of the domain in CALM1 in contact with the third motif. The structure shows the predicted interface and VUS variants (gray). ii Predicted structure illustrating the second motif on the Eh domain of the domain in CALM1 in contact with the third motif. The structure shows the predicted interface and pathogenic (red) with VUS variants (gray). iii The zoomed-out predicted structure is shown in ii. iv i Predicted the interface interaction structure of the Eh domain of the domain in CALML3 in contact with the third motif. The structure shows the predicted interface and VUS variants (gray). (C) i The schematic PPIs illustrate interaction profiles of wild-type and mutated IQCB1 interaction. ii The schematic PPIs illustrate interaction profiles of wild-type and mutated CALM1 and CALML3 interactions. (D) i Experimental assessment of the variants on known DMIs of IQCB1-CALM1 (left) and IQCB1-CALML3 (left) using BRET saturation assay. ii Experimental assessment of the CALM1 pathogenic (right) and VUS (left) variants on putative DMIs of IQCB1-CALM1 using BRET saturation assay. iii Experimental assessment of the CALML3 variants on putative DMIs of IQCB1-CALML3 using BRET saturation assay. Calmodulin is an essential calcium-sensing, signal-transducing pro- tein. Three calmodulin genes, CALM1, CALM2, and CALM3, have 194 BRET BRET BRET unique nucleotide sequences but encode identical calmodulin proteins with 4 EF-hand calcium-binding domains. Calcium-induced activation of calmodulin regulates many calcium-dependent processes and modu- lates the function of cardiac ion channels. F90L is a pathogenic variant found in patients with LONG QT SYNDROME 14 and documented in Clinvar. The substitution occurs at a highly conserved residue between EF-hand domains II and III. The pathogenicity was not functionally studied. According to the position of the residue at the hydrophobic clutch of EF-hand domains (Figure 3.12 B (ii)), the experimental data showed that F90L disturbed the binding (Figure 3.12 D (ii)), while the other pathogenic variant E105K located outside of the inter- face (Figure 3.12 C (ii)) did not have any effect(Figure 3.12 D (ii)). This variant occurred de novo in a patient submitted for whole exome sequencing and it does not have functional evidence. Although the ex- pression of this mutant is very high (see Appendix, Figure 5.7 A), it might partially destabilize the mutant causing the pathogenic effect. But it also might mean that the variant is not pathogenic, and the in- silico analysis reported in ClinVar is incorrect. Although E105 is outside the direct binding interface, it is part of the hydrophobic clutch that me- diates the interaction between the EF-hand domains. A disruption here could impair the coordinated movement and proper orientation of these domains, reducing the ability of calmodulin to expose the necessary hy- drophobic patches for binding target proteins effectively. We also tested the three VUS variants, where M110L in CALM1 was predicted to be in the domain, and D94H and R87H were predicted to be outside of the interface (Figure 3.12 D (iii, iv)). Surprisingly, M110L caused a slight reduction, while the D94H variant was found in patients with Catecholaminergic polymorphic ventricular tachycardia 4 and Long QT syndrome 14 significantly affected the interaction with CALM1 (Figure 3.12 D (ii)). The VUS A89D in CALML3 also showed the effect on interaction with IQCB1 (Figure 3.12 D (iii)). Interaction profile for variants falling in SPOP We also tested pathogenic variants detected on the MATH domain of SPOP on the interaction with its partners RXRB and MYD88 (Figure 3.13 B (i-ii)). As expected the mutants perturbed the interac- tion with the predicted motif on RXRB (Figure 3.13 D (i)). Interest- ingly, the Y87C variant did not affect BRET with MYD88 (Figure 3.13 D (ii)). However, the predicted interface might not be correct, as it was previously shown in literature and in this study, it was expected that these mutants might not be on the correct interface with MYD88 and 195 have no effect on binding. In agreement with this assumption, the VUS variants also did not change the interaction with MYD88. A SPOP B C ii G132V Y87C ii iii P13R G132V Y87C D C ii iii acceptor/donor expr acceptor/donor expr acceptor/donor expr Figure 3.13: The effect of variants falling into the motif of SPOP using interaction profiling. (A) Schematic illustration of the functional regions within SPOP with the location of variants on MATH domain, where the color indicates the pathogenic (red) variants. (B) i Predicted the structure of the MATH domain of the domain in SPOP in contact with the motif of RXRB. ii Negative control interaction SPOP-MYD88 using the novel interface using fragmentation AF-MM approach. The predicted model shows the MATH domain predicted to bind to the N-terminal motif on MYD88 with the pathogenic variants on the MATH domain. iii The same structure with VUS variant on a predicted motif in MYD88. (D) i Experimental assessment of the variants on known DMIs of SPOP-RXRB using BRET saturation assay. ii Experimental assessment of the effect of pathogenic variants on SPOP on the interaction with MYD88. iii Experimental assessment of the effect of VUS variants on moti of MYD88 on the interaction with SPOP. 196 BRET The effect of variants sitting on motif In addition, we tested successfully cloned mutants on the motifs of part- ners of our candidate partners LITAF, IKZF1, DMRTB1, FAM167A, DAZAP2, CPSF6 and TRAPPC2L. VUS variants found close to the first PPSY and on the second PPSY of LITAF (Figure 3.14 B (i-ii)) do not disrupt the interactions with WWOX (Figure 3.14 C (i-ii)).The lack of effect observed with the mutants could be due to the nature of the substitution; it may not be significant enough to affect the binding affinity between the two proteins, thereby failing to cause a noticeable disruption in the interaction. Further experiments could help clarify the extent of this variant’s impact on different protein partners. Additionally, this interaction is maintained by two PPSY motifs and the WW-1 domain, which can compensate for the loss of a single contact point, masking the effect of certain variants. It would be interesting to test the perturbation effect of this VUS on interactions with other partners mediated by the same DMI but only one interface to determine if the variant disrupts those interactions or keeps them similarly intact. In addition, the BLI experiment showed that mutant Y61D was localized similarly to the wild-type. 197 A LITAF B C A19T P17L ii ii P58L Y61D P59R D acceptor/donor expr NL-WWOX mCit-LITAF NL-WWOX mCit-LITAF Merged NL-WWOX-mCit-LITAF NL-WWOX-mCit-LITAF_Y61D Figure 3.14: The effect of variants falling into the motif of LITAF using interaction profiling. (A) Schematic illustration of the functional regions within LITAF with the location of variants on motifs, where the color indicates the VUS variants. (B) i Predicted the structure of the WW-1 domain and recognized the first PPSY motif on LITAF. It also shows VUS variants situated close to the motif. ii Predicted the structure of the WW-1 domain and recognized the second PPSY motif on LITAF. It also shows VUS variants situated close to the motif. (C) i Experimental assessment of the VUS LITAF variants on known the first motif using BRET saturation assay. ii Experimental assessment of the VUS LITAF variants on known the second motif using BRET saturation assay. (D) The BLI experiment tested the localization of H37D variant compared to the wild-type WWOX. The intensity of nanoluc luciferase tagged LITAF (wild-type and mutant) was shown inverted and magenta and mCitrine tagged WWOX wild-type was shown inverted and cyan. The images of both interacting proteins are merged. The variant located on the flanking regions close to the predicted motif of IKZF1 (M31V, S41L) showed a slight effect. VUS variants found on the motifs of FAM167A and DMRTB1, e.g. (R178H (Figure 3.15 B (i-ii)) and V8M (Figure 3.15 C (ii-ii)) showed lower BRET compared to wild-type (Figure 3.15 B (iii) and C (iii))). On the other hand, 198 BRET BRET VUS located away from the motifs showed similar BRET results as wild- type interactions (Figure 3.16). A B IKZF1 DMRTB1 ii iii ii iii R178H D55 acceptor/donor expr acceptor/donor expr C NL-CTBP1 mCit-DMRTB1 NL-CTBP1 mCit-DMRTB1 MergedDFAM167A NL-CTBP1-mCit-DMRTB1 ii iii NL-CTBP1-mCit-DMRTB1_R25H V8M ii NL-PPP3CA mCit-FAM167A NL-PPP3CA mCit-FAM167A Merged NL-PPP3CA-mCit-FAM167A NL-PPP3CA-mCit-FAM167A_V8M acceptor/donor expr Figure 3.15: The effect of variants falling into the motif of IKZF1, DM- RTB1 and FAM167A using interaction profiling. (A) i Schematic illustration of the functional regions within IKZF1 with the location of VUS (gray) variants on motifs. ii Predicted the structure of the CTBP1 domain and recognized the first PEDLS motif in IKZF1. It also shows VUS variants situated close to the motif. iii Experimental assessment of the VUS variants on a known motif in IKZF1 using BRET saturation assay. (B) i Schematic illustration of the functional regions within DMRTB1 with the location of VUS (gray) variant on the motif. ii Predicted the structure of the CTBP1 domain and recognized the first PLDLR motif on DMRTB1. It also shows VUS variant situated close to the motif. iii Experimental assessment of the VUS variant in putative motif in DMRTB1 using BRET saturation assay. (C) i Schematic illustration of the functional regions within FAM167A with the lo- cation of VUS (gray) variant on the motif. ii Predicted the structure of the CTBP1 domain and recognized the first PLDLR motif on DMRTB1. It also shows VUS variant situated close to the motif. iii Experimental assessment of the VUS variant on a putative motif in FAM167A using BRET saturation assay. Here we evaluated the effect of variants on the PPIs mediated by domain-motif interfaces. We showed that variants located within the motif region can disrupt interactions, potentially altering the function of these interactions and contributing to disease development. Moreover, mutations near the motif region may also slightly affect the interface, potentially disrupting the biological processes mediated by these inter- actions. 199 BRET BRET BRET A B DAZAP2 CPSF6 ii ii iiiiii P383A Y46C P383A acceptor/donor expr acceptor/donor expr C TRAPPC2L ii iii acceptor/donor expr Figure 3.16: The effect of variants falling into the motif of DAZAP2, CPSF6 and TRAPPC2L using interaction profiling. (A) i Schematic illus- tration of the functional regions within DAZAP2 with the location of VUS (gray) variant close to the motif. ii Predicted the structure of the WW-1 domain and rec- ognized the predicted motif in DAZAP2. It also shows VUS variant situated close to the motif. iii Experimental assessment of the VUS variant on a predicted motif in DAZAP2 using BRET saturation assay. (B) i Schematic illustration of the functional regions within CPSF6 with the location of VUS (gray) variant close to the third mo- tif. ii Predicted structure of the WW-1 domain and recognized the third motif on CPSF6. It also shows VUS variant situated close to the motif. iii Experimental assessment of the VUS variant in putative motif in CPSF6 using BRET saturation assay. (C) i Schematic illustration of the functional regions within TRAPPC2L with the location of VUS (gray) variants close to the motif. ii Predicted the structure of the Eh domain in REPS1 and recognized putative motif in TRAPPC2L. It also shows VUS variant situated close to the motif. iii Experimental assessment of the VUS variants on a putative motif in TRAPPC2L using BRET saturation assay. .However, not all mutations within the interface necessarily disrupt the interaction. Some residues, even when mutated, may not signifi- cantly alter the interface if the substitution does not substantially change the binding strength. Additionally, some interactions may be stabilized by multiple interfaces, which can compensate for the loss of a single con- tact point, masking the effect of certain variants (e.g. LITAF-WWOX). On the other hand, the disruption of the interaction might be caused by partial folding or mislocalization of the mutant, rather than direct interference with the binding interface. Therefore, additional studies are 200 BRET BRET BRET needed to confirm these possibilities and to determine whether observed disruptions are due to changes in protein structure or localization. Overall, our findings indicate that integrating interaction disruption profiles with DMI interface information can enhance our understanding of variant effects in the context of PPI interactions. This combined approach allows for a more nuanced characteriza- tion of variants, potentially leading to better identification of disease- associated mutations and providing deeper mechanistic insights into their role in disease pathology. However, considering the complexity and number of interfaces that mediate interactions, the diverse biolog- ical processes they influence, structural conformations and the specific properties of each amino acid at the contact sites, and the residue it is changed to this strategy can be further refined to achieve more accurate and controlled results. 201 Chapter 4 Conclusion and future perspectives 4.1 Deciphering protein interaction interfaces using DMI predictor tool The development of the DMI tool and its application to HuRI annotate about 3200 protein-protein interactions (PPIs) with high-confidence pu- tative DMI interfaces (see Chapter 3 section 3.3.), providing valu- able insights into the mechanistic functions of these interactions. This advancement has greatly enhanced our ability to understand how spe- cific mutations might disrupt interactions, aiding in the characteriza- tion of variants found in patients. By analyzing how a variant perturbs PPIs, we can hypothesize its potential contribution to the development of disease symptoms or aetiology. Such hypotheses can then be tested through downstream experiments, which is crucial for the advancement of precision medicine. Despite these advancements, there is still room for improvement in the performance of the DMI tool. One issue is its inability to distinguish between repetitive tandem domains, (e.g. RR1 and RR2), which often appear sequentially within proteins and may serve different functions in mediating interactions. Incorporating domain-specific annotations and functional classifications can help differentiate between tandem repeats by considering their unique roles and sequence patterns. Advanced pat- tern recognition methods and contextual analysis can refine sequence analysis. Another limitation identified during manual analysis is that some pre- dicted DMIs did not meet the cutoff due to low IUPred scores, despite the motifs being disordered. This issue is likely due to the window-based nature of IUPred, where regions adjacent to folded segments are often predicted as folded. Therefore, enhancing the window size or incorporat- ing additional prediction tools could improve the tool’s ability to detect 202 likely true motifs, for example, AF-MM. Our findings also showed that the variants found on flanking regions of the motif can also slightly affect the interactions, suggesting the po- tential involvement of these regions in maintaining the interface of a PPI. This insight can be integrated into the refinement of the poten- tially functional regions and variant effect characterization, helping to refine the understanding of how flanking regions contribute to interface stability and potentially influencing the assessment of variant impacts on protein interactions (Luck et al. 2012). Furthermore, with the recent update of the ELM database, which has enriched SLiM classes with new instances, re-running the DMI tool using this updated dataset could significantly enhance prediction accuracy and outcomes. In addition, our predicted and experimentally validated data can 4.2 The application of DDI predictor and Al- phaFold to map the PPI data with interaction interfaces There is an overwhelmingly large number of PPIs that are not mapped with any known interfaces pointing to the fact that many interface types remain still uncovered, especially those involving motifs (Rolland et al. 2014; Tompa et al. 2014). To detect these interfaces, the AF-MM approach can be employed to identify novel interfaces, which can then be mapped onto PPIs and overlapped with mutation data, as demonstrated in Chapter 2, Article II. All in all, using AF-MM to discover novel interfaces holds great potential as it bypasses the need for a reference list of interface types for interface searching. However, scaling this approach for higher throughput will require further development. Another type of interface that was not mapped in this study is a domain-domain interface (DDIs). Given the more stable nature of folded domains and the interactions they mediate, structural information on DDIs is more abundant compared to DMIs. For example, the 3did database extensively catalogs DDIs (Geist et al. 2024). Our lab has assessed the quality of DDIs in 3did providing us with important insights regarding features that can aid in scoring predicted DDIs for their abili- ties to mediate PPIs (Geist et al. 2024). Incorporating these insights can improve the mapping of the PPI dataset with DDIs that help to interpret the effect of variants on protein function. 203 4.3 Enhancing Predictive Accuracy of Variant Ef- fects and Mutation Design through Positioning on Predicted AF-MM Interface Structures The application of AF-MM to predict novel interfaces, for which the resolved structures are not available, significantly aided in understand- ing how well the putative motif fits into the binding pocket through visualization. This process allows for the detection of residues in close contact, the assessment of the structural location of mutated residues, and the design of mutants for experimental validation. While the struc- tural information can give insight into the predicted interfaces and help in variant characterization, the manual inspection of predicted struc- tures and the localization of variants is time-consuming. To address this limitation, my colleague is currently working on applying AF-MM to the entire set of DMIs with overlapping pathogenic variants of VUS mutations to analyze and implement the structural information. 4.4 Improvement of the BRET assay to validate the predicted interfaces While the medium-throughput cloning pipeline and BRET assay de- veloped in this study have been valuable for validating predicted in- teraction interfaces, several steps within the pipeline could be opti- mized to enhance both efficiency and accuracy. Currently, the manual picking of colonies is a bottleneck in the current plate-based medium- throughput pipeline, requiring substantial time and labor to select indi- vidual colonies for inoculation. Implementing automated colony pickers could address this issue by handling multiple colonies simultaneously with higher precision, thereby speeding up the workflow and reducing the risk of contamination or human error. Using BRET assay we detected about 50% of protein-protein inter- actions from HuRI. Although this detection rate aligns with previous studies or was even higher, there is still room for improvement. Ini- tially, we cloned fusion proteins exclusively at the N-terminal, based on the observations that the expression is better at the N-terminal. Trepe et al (2018). However, Trepte et al. have demonstrated that testing protein pairs in various configurations increases detection rates while maintaining low false detection rates (Trepte et al. 2018; C. Trepte S. et al. 2021). They also showed that tagging the proteins close to the interaction interface might improve PPI detection. While cloning tags 204 in different configurations or close to the interface could enhance the detection of PPIs, this approach might also increase the time required for cloning. Additionally, choosing a more sensitive fusion tag can enhance the detection capability of the BRET assay regardless of the tag’s position relative to the interaction interface. For example, using tags with higher quantum yields or those that offer better resonance energy transfer ef- ficiencies can lead to stronger and more reliable BRET signals. For example, using mNeonGreen as an acceptor fluorophore in BRET as- says significantly increased the dynamic range and sensitivity compared to traditional GFP derivatives (Shaner et al. 2013). These more sen- sitive tags could improve detection sensitivity even if the tag is not in the optimal position relative to the interaction interface. The NL and mCit fusions used in the BRET assay allowed us to monitor the expression levels of wildtype and mutant constructs, which is important to rule out loss of binding because of a destabilization of the protein. However, we cannot exclude the possibility that some expressed mutants might still be partially unfolded or mislocalized and thus, some loss of binding detected in our study could be unspecific and not the result of a specific perturbation of the predicted interface (Lacoste et al. 2023). Advanced imaging techniques could be scaled up and integrated into the workflow to assess whether mutant proteins are mislocalized. This approach would help determine if the observed binding loss is due to the mutant proteins being in the incorrect cellular compartment rather than a direct effect on the interaction interface. While I attempted to implement BRET-based bioluminescence imaging (BLI) to test whether the localization of a mutant is not changed compared to the wild-type protein, we faced challenges in the setup of experimental steps that needed to be optimized for robust quantitative analysis. This optimiza- tion involves the finding optimal amount of cells for seeding, the DNA ratio for more efficient transfection, the concentration of the transfec- tion agent as well as downstream analysis involving defining regions of interest (ROIs) for specific cellular compartments, using either manual methods or automated segmentation algorithms. This approach allows for assessing whether the mutant proteins overlap with these markers and determining any shifts in localization. Further statistical analysis would ensure that any observed differences are significant and not due to random variation. Implementing these strategies will help confirm if localization changes contribute to the observed effects, thereby provid- ing a more accurate interpretation of the impact of mutations on protein function. 205 Along with mislocalization studies, BRET-based imaging can be used for the detection of a single BRET within the cell (Dragulescu- Andrasi et al. 2011; Kobayashi et al. 2019). The determination of BRET per cell might enhance the precision of interaction studies by providing detailed insights into how individual cells contribute to the overall interaction dynamics. Quantifying BRET signals per cell allows for a more granular analysis of the interactions, potentially revealing variations in PPI strength and localization that may be masked in bulk measurements. Moreover, BRET-based microscopy can be applied not only in mammalian cells but also in tissues and in vivo animal models (Dragulescu-Andrasi et al. 2011). Kobayashi et al. (2019) demon- strated the use of BRET-based imaging to monitor protein interactions and subcellular localization in live animal tissues. Their study empha- sized that BRET, with its enhanced dynamic window due to reduced background signals, is particularly effective for detecting subtle changes in protein interactions. They also illustrated the quantification of BRET signals, including the dissociation of protein complexes and redistribu- tion within cellular compartments. For instance, they used manual seg- mentation and pixel-by-pixel analysis to quantify BRET signals from specific subcellular regions, revealing significant changes in protein in- teractions upon receptor activation. This quantitative approach enabled precise measurements of BRET signal changes, facilitating detailed in- sights into dynamic biological processes such as receptor endocytosis and protein localization in vivo. However, it was done on a small scale. The development of a plate-format scalable BRET-based BLI pipeline has to be addressed. 4.5 General outlook This thesis has proposed a strategy driven by prediction and experimen- tal validation of domain-motif interfaces and integrating this information to interpret the effect of uncharacterized variants on protein function. In doing so, we have gained profound insights into the intricate inter- play between different functional modules, such as domains and motifs in proteins that facilitate their interactions. Moreover, using this strat- egy we provided experimental evidence and structural information on the effect of variants falling into DMIs mediating protein-protein inter- actions. This information can be explored in future studies aimed at delineating potential molecular mechanisms causing disease. Given the useful mechanistic insights that prediction tools like the DMI predictor tool can provide, I expect the optimization and applica- 206 tion of these tools (DDI predictor and AF-MM) in mapping PPI with interfaces to bring us closer to a fully structurally annotated human pro- tein interactome mapped with interfaces. Moreover, I anticipate greater inclusion of interface information in experimental workflows, where this will help generate hypotheses to guide experiments and aid in variant characterization. Binary interaction assays like BRET have proven to be suitable tools for validating PPI interfaces, but there are still several ways to further enhance their capabilities in characterizing variant effects on PPIs(Dragulescu-Andrasi et al. 2011; Kobayashi et al. 2019). In addition to expanding the power to systematically assess the effects of variants on protein-protein interactions (PPIs), it is crucial to implement systematic downstream steps (e.g., reporter assays, cell proliferation, apoptosis assays) to gain deeper insights into how these variants impact biological processes. By integrating these additional steps, researchers can move beyond just identifying whether a variant disrupts a specific interaction and start understanding the functional consequences of these disruptions within the context of cellular pathways and networks. I an- ticipate seeing the advancements of this assay in this direction. 207 Chapter 5 Appendix 5.1 Protocols 5.1.1 The medium-throughput cloning protocol 208 Medium-throughput GATEWAY cloning protocol Data organization in MySQL DB cloning_data Every HTP cloning project should have a bioinformatician assigned to it who helps with putting the data in the tables. Everything that the experimentalist can do on his/her own should not be done by the bioinformatician. table project_descr: column_name content project_id e.g. CL01 experimenter e.g. Christian bioinformatician e.g. Eric descr e.g. cloning project for XL-MS project, more PRS pairs, every ORF cloned in N-ter NL and N-ter mCit vector date_started e.g. 2022-10-11 table orf_pairs: column_name content project_id e.g. CL01 orf_a e.g. 49583 !"#$"%&'"()*+",$"-".-#/"-/'"#0"0,".-/%#123-/",/4'/"56".2%"+7-33'/",/$8#4"-+" " " ,/$8-9",%&'/:#+'"4'+1/#;'",/4'/",$"()*+"#0"4'+1/"1,3270"#0"./,<'1%84'+1/"%-;3' orf_b e.g. 98584 table entry_clone_info: column_name content orf_id e.g. 49583 orf_len_nt e.g. 1980 entry_plate_id e.g. GDEh81001 !".3-%'"0-7'"$/,7"()*',7' entry_well_id e.g. A01 !":'33"=>"$/,7".3-%'"$/,7"()*',7' entry_inoc_plate_id e.g. CL01GEh_01 entry_inoc_well_id e.g. C10 pcr_amplicon 0 or 1 !"?"#$"@A)"./,421%"3,,B+"C,,4",0"C'39"D",%&'/:#+' comments space to leave additional comments for an ORF if needed table expr_clone_info: column_name content orf_id e.g. 49583 expr_plasmid_id e.g. KL_11 expr_plasmid_name e.g. pcDNA3.1 cmyc-NL-GW LR_plate_id e.g. CL01LR_01.1 LR_plate_well e.g. C10 colonies 0 or 1 !"?"#$"%&'/'":'/'"1,3,0#'+"+'3'1%'4"$,/".#1B#0C9"D",%&'/:#+' expr_plate_id e.g. CL01GExh_01.1 expr_plate_well e.g. B05 MP_elu_plate_id this and the next 3 columns only need to be filled if rearray occurred MP_elu_plate_well expr_dil_plate_id expr_dil_plate_well DNA_conc_ng_ul e.g. 235 theor_DNA_conc_dil e.g. 100 seq_confirmed_bb_fw 0 or 1 seq_confirmed_bb_rv 0 or 1 seq_confirmed_full_length 0 or 1 comments space to leave additional comments for an ORF if needed 1 209 Material List: In separate excel sheet with calculator for amounts - you can find this checklist on the: C/,2."4/#E'!"FG@813,0#0C84-%-!"%'7.3-%'+! CL00_checklist_consumables Prior to start of cloning Computational part - H":''B+"#0"-4E-01'I"C'%"13,0#0C"./,<'1%"=>";J"1&'1B#0C",0"C/,2."4/#E'"#0"FG@813,0#0C84-%-" $,34'/":&-%"%&'"3-+%"13,0#0C"=>":-+9"#01/'7'0%";J"?"1,20%9"#K'K"#$"3-+%"13,0#0C"=>":-+"A3D?"!" 7-B'"A3DH - 2 weeks in advance: design and discuss with Katja and bioinformatician for cloning project plate layout for ORF inoculation plates for Day 1 and plate layout for inoculation plates of picked LR transformants for Day 4 (the way the plates should be organized, code can be written but the rearray is only at day 4 possible) - consider to leave a well free for the water control for the PCR and if and which controls you would like to have for LR - >'+#C0"3-;'3+"$,/"J,2/".3-%'+"!"%,"+''"-0"'L-7.3'"$,/"&,:".3-%'"3-;'3+"+&,234";'"4'+#C0'4" -04"$,/":&".3-%'+"3-;'3+"-/'"0''4'4"!"%-B'"-"3,,B"-%"%&'"$#3'" FG@813,0#0C8.3-%'83-;'3+8+1&'7-%#1K.4$"!"%&'0"+%-/%"$/,7"%&'".3-%'"3-;'3+"$/,7"-"./'E#,2+" 13,0#0C"./,<'1%";J"7-B#0C"-"1,.J",$"%&'".3-%'83-;'3+K%L%"$#3'"$/,7"-"./'E#,2+"13,0#0C"./,<'1%"#0%," %&'"$,34'/",$"J,2/"0':"13,0#0C"./,<'1%",0"%&'"C/,2."4/#E'"-04"7,4#$J"-11,/4#0C3J"!"#$"J,2"-/'" 0,%"+2/'9"@MNOPN9"-+B"!"#$".3-%'"3-;'3+"-/'":/,0C9"1,7.2%-%#,0-3"-+":'33"-+"'L.'/#7'0%-3" +%'.+",$"J,2/"13,0#0C"./,<'1%"1-0"C,":/,0C - Print the plate labels with the help of Mareen (a template for the labels can be found on group drive, HTP_cloning_data, templates, HTP_plate_labels; please also read the explanation how to print these labels) - Q'%".3-%'"3-J,2%+":#%&"%&'"&'3.",$"-";#,#0$,/7-%#1#-0"!"%&'"+1/#.%+"%,"4'+#C0"%&'".3-%'"3-J,2%+" #4'-33J"0''4"#0$,/7-%#,0"-;,2%"%&'".3-%'"3-;'3+ Experimental part: - O;,2%"R":''B+"#0"-4E-01'"./'.-/'"1,7.'%'0%">FS-3.&-"N"A,3#"1'33+"!"%-3B"%,"T-/''09"%&'" ./'.-/-%#,0"#%+'3$"%-B'+"?":''B - 2-4 weeks in advance do maxi, midi or miniprep of empty expression vectors and plasmids for LuTHy assay (KL_01, KL_02, KL_03, KL_06, KL_07, KL_11, KL_247) - At least 2 weeks in advance, make a copy of the excel sheet with the list of reagents and save it in your cloning folder on the group drive, calculate your amounts and check that everything is available - 2 weeks in advance check the amount of - PCR plates (order no.: 781352 from Brand) - PCR foil - Reservoir for LB medium (order no.: HT69.1 from Carl Roth) - Costar plates (order no.: 3799 from Corning) - Microplate aluminum sealing tape (order no.: 6570 from Corning) 2 210 - Adhesive gas permeable seals (order no.: AB-0718 from Thermo Scientific) - Combitip advanced 1ml (order no.: 0030089.430 from Eppendorf) - Qtray with lid and divider (square plates for Agar; order no.: MLDVX6029 from VWR international GmbH) - E-Gel 96 1%Agarose (GP) (check the expiring date; order no.: G700801 from Invitrogen) - E-Gel 96 High range DNA marker (order no.: 12352019 from Invitrogen) - Steril/autoclaved 2ml Deepwell plates, 96 round wells (order no.: E2896-2110 from Starlab) - Qiaprep 96 Plus MiniPrep Kit (order no.: 27291 from Qiagen) - QIAvac 96 (vacuum system needed for the MiniPrep; order no.: 19504) - 1250 µL (blue) integra grip tips for a digital multichannel pipette - 125 µL (yellow) integra grip tips for a digital multichannel pipette - 12,5 µL (pink) integra grip tips for a digital multichannel pipette - 1 week in advance - get familiar with - The digital multichannel pipette - All other equipment you will need - The excel sheet to calculate the amounts - SQL database - LuTHy assay transfection template - The scripts for the different steps - 1 week in advance take all needed consumables on your bench or -20°C - 1 week in advance - check the amount of: - 40% glycerol (sterile) - Proteinase K (2µg/µl) - HF PCR polymerase and buffer (from Protein Production CF) - dNTPs (NEB freezer at IMB) - 96 gel loading buffer (Homemade, recipe:10mM Tris-HCl, 1mM EDTA, 0,005% bromophenol blue) - LR clonase (from Protein Production CF, should be stored at -80°C or better -150°C) - SOC medium (~8ml/plate) - Needed antibiotic - Ampicillin (100mg/ml) - Kanamycin (30mg/ml) - Spectinomycin (50mg/ml) - LB medium - LB-Agar (250ml/square plate) - Sterile glass plating beads - Sterile toothpicks - SOC medium - At least 1 week in advance, order sequencing barcodes for the plates (Starseq) - Between 1 and max up to 5 days in advance, prepare square plates with agar - At least 1 day in advance, sort ORFeome plates in new rack - Do this step with one additional person as helper - work with blocks of 7 plates, because they fit as one block in the rack - Presort your ORF plates into a new rack; you will need ~1h per 10 plates (including time to let the -80 come back to temperature): Take out rack from -80 freezer, close freezer, sort out the plates needed in a box with dry ice. Put the rack back in the freezer and sort the plates into a new rack according to the order you will pick from them. Let the freezer get back to -80˚C before you go for the next batch of plates. - fill PCR protocol for X-reactions with calculations - Stock SOC medium - Cut small pieces of Alu foil for resealing of plates - Aliquot expression vectors in PCR stripes - Dilute primer for PCR in 1,5ml eppi - Dilute and aliquot primer for sequencing in PCR stripes - after LR we are sequencing with forward and reverse primer at the same time - after running the sequencing pipeline you will see for which ORF you need to design primers for full length sequencing - Forward primer: 3 211 - For N-terminal NL-fusion: primer #44 NanoLuc-398fwd (GAACGGCAACAAAATTATCGAC) - For N-terminal mCit-fusion: primer #47 mCitrine-547fwd (AGCAGAATACGCCCATCG) - Reverse primer: - If there is no C-terminal fusion: primer #51 pEXP_rev (GGCAACTAGAAGGCACAGTC) Overview of the plates Step Plate label (example) Type of plate Inoculation plate CL01GDEh_01 Costar plate (#3799) PCR plate CL01PCR_01 PCR plate (#781352) Gel plate CL01Gel_01 PCR plate (#781352) LR plate CL01LR_01.1*, CL01LR_1.2* PCR plate (#781352) Transformation plate CL01TR_01.1*, CL01TR_1.2* PCR plate (#781352) Agar plate CL01TR_01.1a / CL01TR_01.1b, Qtray (#MLDVX6029) CL01TR_01.2a / CL01TR_01.2b Deepwell inoc plate CL01GExDW_01.1a / CL01GExDW_01.1b Deepwell plate (#E2896- CL01GExDW_01.2a / CL01GExDW_01.2b 2110) Glycerolstock plate CL01GEx_01.1, CL01GEx_01.2 Costar plate (#3799) MiniPrep elution CL01GExMP_01.1, CL01GExMP_01.2 Costar plate (#3799) DNA dilution plate CL01GExDil_01.1, CL01GExDil_01.2 PCR plate (#781352) DNA Database plate CL01GExSt_01.1, CL01GExSt_01.2 PCR plate (#781352) DNA sequencing plate CL01GExSF_01.1, CL01GExSF_1.2 PCR plate (#781352) (forward and reverse) CL01GExSR_01.1, CL01GExSR_1.2 All plates should be labeled at the left side (having A1 top left corner) *where applicable plates labelled x.1 contain NL fusion constructs, plates labelled x.2 contain mCit fusion constructs 4 212 Day 1 Picking and inoculation of ORFs (~1.5h just picking) Checklist: ● 70% EtOH and tissues - to sterilize the plates from the Orfeome collection ● 50 mL falcon tube - to prepare mix of LB medium with corresponding antibiotic ● Tips - for picking up the ORF from the collection plate ● Alu foil cut in small pieces (size of one well) to close the opened wells with ORF ● 50 mL serological pipette ● Pipette boy ● 100 µL pipette ● Pipette tips ● 96-well, costar plate (Corning,#3799) - for the inoculation of the ORFs ● Adhesive gas permeable seals (order no.: AB-0718 from Thermo Scientific) ● Multichannel pipette with tips ● LB medium (200 µL per well) ● Antibiotic (Kanamycin or Streptomycin - 0,2µl/well) ● Dry ice box - for keeping the plates from the ORFeome collection while picking the ORFs Do the following steps with one, better two additional people as helpers! steps: 1. Use aseptic bench working technique 2. Label the inoculation plate (CL01GDEh_01) 3. Prepare a master mix of LB medium and antibiotic in a 50ml Falcon and vortex a. 200µl LB medium/well b. Pay attention to which ORF needs which antibiotics c. 1:1000 mixing of antibiotic to LB medium (i.e. 1µL into 1000µL of LB medium) 4. Prepare the reservoir for the LB medium 5. Pour the antibiotic LB mix in the reservoir 6. Use the 300µl multichannel pipette to distribute 200µL of antibiotic LB mix to each well in the 96-well plate 7. Take out the box with dry ice and put the first 7 plates with the needed ORFs on it a. make small stacks to keep cold 8. Work as fast as possible on dry ice here (working with 3 people simplifies the process, 1. person is taking the plates out, 2. person is picking the ORFs, 3. person is controlling) a. Disinfect the alu foil of the orfeome plate b. With the tip/toothpick make hole in the selected well c. Take another tip and scratch the ORF d. Then put it into the well of the inoculation plate with LB/Antibiotic medium, stir for a few seconds and discard the tip e. Immediately close the hole with the pre-cut alu foil pieces 9. Repeat steps 8b - 8e for each well to pick from a plate 10. Move to the next plate until done with the first batch then go back to step 7 11. Seal the inoculation plate with the air permeable adhesive seal 12. Incubate the plate overnight at 37 ˚C, 190rpm, a. cover with a paper box (to reduce evaporation) b. Incubator in the Niehrs lab 5 213 Day 2 PCR, Glycerolstock, E-Gel and LR reaction Checklist PCR: ● 96-well skirted PCR plate - for PCR reactions ● 50 mL falcon tube - to prepare PCR master mix, 50ml because of multipette ● Alu foil - to cover the glycerolstock plate ● PCR foil to cover the PCR plate ● Multichannel pipette ● Multipette and combitips 1ml ● 100 µL (yellow) integra pipette tips special for a digital multichannel pipette ● 10 µL (pink) integra pipette tips special for a digital multichannel pipette ● 10ml reservoir - to pipette the reaction and transfer to the PCR plate ● Ice block - to keep PCR components in cold ● 40% glycerol stock ( = 10 mL) ● PCR components ● PCR plate containing inoculated ORFs - for PCR ● E-Gel Checklist for E-Gel: ● PCR plate ● Compitip advanced 2,5ml ● 96-well E-Gel ● 96 gel loading buffer (Homemade, recipe:10mM Tris-HCl, 1mM EDTA, 0,005% bromophenol blue) ● DNA marker E-Gel ● 50 µL manual multichannel pipette for loading the gel ● 200 µL tips ● BioRad Detection Machine Checklist LR reaction: ● Cold PCR block: thermomix block keep it in the cold ● 2 PCR plates for mCit and NL fusions ● PCR plate (CL01PCR_01) with PCR products ● DNA for expression vectors (KL_11 & KL_247), diluted to 200 ng/µl, aliquoted to PCR stripes with 20µl each ● Ice box ● Autoclaved water ● 12.5 µL multichannel pipette (Tick) ● 125 µL multichannel pipette (Trick) ● 100 µL (yellow) integra pipette tips special for a digital multichannel pipette ● 10 µL (pink) integra pipette tips special for a digital multichannel pipette ● Rack for eppi tubes for LR clonase (the distance of the small racks work with the digital multichannel pipette) 6 214 PCR PCR program: Temperature Time Repeat Step 98˚C 30s 1x Initial denaturation 98˚C 10s 30x Denaturation 55˚C 30s 30x Primer annealing 72˚C 3min 30x Extension 72˚C 5min 1x Final extension 16°C U 1x Hold Master Mix PCR components Per 1 reaction (=1 well) Per 100 reactions Primer #48 pENT-F 10µM 2.5 µL 250 µL Primer #49 pENT-R 10µM 2.5 µL 250 µL dNTPs (10mM each dNTP) 1 µL 100 µL 10x High fidelity polymerase 5 µL 500 µL buffer High fidelity DNA polymerase 0.55 µL 55 µL H2O 34.45 µL 3445 µL (= 5x 689 µL) Steps PCR: 1. Label the PCR plate (CL01PCR_01) 2. Once the PCR components start to thaw. Vortex each PCR reagent 3. Prepare a master mix of all PCR components (see table) a. In 50mL falcon tube 4. Pipette 46 µl of the master mix in each well of the PCR plate a. Using the multipette and combitip 1ml (on ice/cold block) 5. One well should be used as control (master mix without ORF) 6. Remove the airpore seal of the inoculation plate 7. Close the inoculation plate with aluminum foil 8. Vortex the inoculation plate 9. Carefully remove aluminum foil 10. Transfer 4µL of the inoculated ORF culture to the PCR plate a. With the manual 10µL multichannel pipette b. The ORF layout is the same c. Always use new tips 11. Close the PCR plate with PCR foil 12. Vortex the plate briefly 13. Centrifuge briefly 7 215 14. Run the PCR ( ~3 hours) Steps glycerolstock: 1. Check two wells how much bacteria culture is left 2. Then remove a “certain” amount to have 100µl of bacteria culture left in the inoculation plate 3. Add 100 µL of sterile 40% glycerol to each well of the inoculation plate (1:1 ratio) 4. Close the plate with alu foil 5. shake 45 sec at 800 rpm on the thermomixer 6. Store at -80°C (rack 8) Validation of the PCR product with E-gel - Info: - PCR products can be stored at 4°C for 48h, for longer time freeze PCR products - Document all wells that do not look ok on gel -> this info needs to go into MySQL table, send info to bioinformatician Steps: 1. Label the E-Gel plate (i.e. CL01Gel_01) 2. Pipette 25 µl of blue 96 gel loading buffer in the E-Gel plate a. Using the multipette and 2,5ml Combitip b. Can be done while PCR is running 3. Add 6 µl of PCR product to each well a. Using the 10µl multichannel pipette 4. Install 96 well E-gel to the motherbase 5. Load 20µl PCR/buffer mix to each well a. Using the 50µl multichannel pipette 6. Load 20µl of E-Gel 96 High range DNA marker 7. All empty wells must also be filled with 20µl a. With buffer or loading dye 8. Insert the plug into the socket 9. Run gel for 12 min a. Program EG 10. Take picture with GelDoc Station 11. Analyze gel picture with the E-Editor 2.0 software a. On the desktop PC in the technical room b. Realign the bands and save it in your cloning project folder c. The software is pretty self-explanatory and has a manual available under the help button. Ask Katja for help. 12. Decide if PCR was successful and whether it is worth proceeding 13. Document all wells that did not look ok a. Add this information to the MySQL table entry_clone_info LR reaction Components Per 1 reaction (1 well) H2O 5,5 µL Destination vector (200ng/µl) 1 µL PCR product 1 µL 4x LR clonase 2,5 µL 8 216 1. Label the LR plates (CL01LR_01.1, CL01LR_1.2) 2. Decide if you want to include controls for the LR reaction a. I.e. no clone (only water), no LR clonase 3. Take out the destination vectors and put it on the bench to thaw a. Prepare a PCR stripe with 8x 20µl of KL_11 (NL-GW) b. Prepare a PCR stripe with 8x 20 µl KL_247 (mCit-His3C-GW) 4. In a clean reservoir pour ~ 2 mL of autoclaved water 5. Add 5,5µl water into each well a. Use 125 µL multichannel pipette (Trick) b. Aspirate 66 µl water and distribute 12x 5.5µl c. Repeat for the second plate. 6. Add 1µl of PCR product a. Use 12.5 µL multichannel pipette (Tick) b. Aspirate 2 µL of PCR products c. load 1 µl to each PCR plate for LR reaction. 7. Add 1µl of the NL expression vector KL_11 a. Into the plates, which will contain NL fusions in the end (i.e. CL01LR_01.1) b. Use multichannel pipette 8. Add 1 µL of the mCit expression vector KL_247 a. Into the plates, which will contain mCit fusions in the end (i.e. CL01LR_01.2) b. Use multichannel pipette 9. Take 8 tubes 4x LR clonase out and put on a rack a. Info: if the LR clonase is still very cold - it is difficult to pipette, LR clonase will be outside of the tip and the resuspension step gets difficult b. better: for 1 plate you will need 8 tubes of LR clonase (each tube contains 40µl) - vortex the tubes, centrifuge LR clonase, wait until LR clonase is easy to pipette (~2min) and then start, the leftover of the LR clonase should be discarded 10. Vortex each LR clonase twice for 2 seconds and put back to the rack 11. Add 2.5 µL LR clonase a. Use 12.5 µL multichannel pipette (Tick) b. Program HTP_LR c. Resuspend and discard the tips 12. Repeat until all wells of both plates has received LR clonase 13. Cover LR-plates with alu foil 14. Incubate overnight at 25°C (in PCR machine) 15. Close the plate with the PCR reaction with alu foil and store the PCR products at -20°C alternative 1. Prepare all needed components for LR 2. Prepare a master mix of your expression vector and water (can be done several days before) 3. Aliquot 6,5µl of water/expression vector mix into both LR plates a. with the multipette 4. Add 1µl of PCR product a. with 10µl multichannel pipette 5. Take 8 tubes of 4xLR clonase 6. Vortex each LR clonase twice for 2 seconds and put back to a rack 7. Add 2.5 µL LR clonase a. Use 12.5 µL multichannel pipette (Tick) b. Program HTP_LR c. Resuspend and discard the tips 8. Repeat until each well of both plates has received LR clonase. 9. Cover LR plates with alu foil 10. Incubate LR plates overnight at 25°C (PCR machine, Thermoblock) 11. Close the plate with the PCR reaction with alu foil and store the PCR at -20°C Stop point: LR plates could be stored at -20°C until processing with transformation 9 217 Preparing square agar plate (should be done at least the day before needed) Check list ● LB Agar (250ml/square plate) ● Square plates and divider ● Microwave 1. Take LB-Agar (250ml) from IMB media lab 2. Use aseptic bench working technique 3. Heat Agar in the microwave (program: soften/melt, 2= melt dark chocolate, 100 = 5,5 min; after 3x the agar is liquid) 4. Let it cool down (i.e. add a clean stirrer to the agar and place the bottle on the magnetic stirrer, adjust the temperature to 50°C and 250rpm) 5. Add antibiotic (250µl) when the agar is cooled down sufficiently and you are ready to pour the plates 6. Take out the plate from the plastic protection 7. Add agar to the plate (pop bubbles with a pipette tip or move them to the side) 8. Take out the grid from the plastic protection 9. Add the grid in the square plate with agar --> the grid does not stay down - weigh down the grid with something (i.e. a 250ml bottle) 10. Let the agar solidify 11. Store at 4°C (upside down) Day 3 Proteinase K digestion, Transformation and plating Check list ● Proteinase K (2µg/µl) ● Ice box for 2 PCR plates with LR reaction ● Ice box for 2 PCR plates with competent DH5α cells ● Thermoblock/PCR machine for heat shock ● Thermoblock/PCR machine for 2 plates for recovery step ● 2 racks for 2 PCR plates ● 10ml reservoir to pour SOC medium ● SOC medium (8ml/plate) ● DH5α (2 PCR plates with aliquots of 30 µL) ● Square plates with agar (48 wells, 4 plates needed for 1 inoculation plate) Proteinase K digestion and Transformation: (~3h) 1. Take out SOC medium (for one well = 80 µL, for 1 plate = 8 mL) and let it thaw at room temperature a. 50ml takes long time to thaw, could be placed at 4°C the afternoon before 2. Use aseptic bench working technique 3. Take out DH5α from -80°C a. Put them immediately on the ice b. Let them thaw c. Label the plate (i.e. CL01TR_01.1, CL01TR_1.2) 4. Take out the LR plates from the incubation 5. Centrifuge briefly the LR plates 6. Add 1µl of Proteinase K into all wells a. Take out 8 tubes 10 218 b. Use multichannel pipette 7. Vortex briefly 8. centrifuge briefly 9. Incubate at 37°C for 10min 10. Transfer the plates on ice 11. Transfer 10µL of each LR reaction into the DH5a plate a. Use a multichannel pipette b. Difficult to get the whole 10µl out (~7µl) c. No resuspension, no vortex when adding the LR reaction into the DH5a d. Close the plate with alu foil 12. Incubate for 30 minutes on ice (bacteria with LR product) 13. Meanwhile: set the thermoblock to 42°C for the heat shock and set thermoblock for 2 plates to 37°C 14. 45sec at 42°C (heat shock) a. One plate after the other 15. Immediately move the plate on ice for 2 minutes 16. Pour SOC medium to the reservoir 17. Transfer 80 µL of SOC medium to each well a. Using a multichannel pipette b. Discard tips after each column 18. Transfer the plate to thermoblock/PCR machine set to 37°C 19. Incubate for 1 hour shaking at 300rpm (no shaking is also working) 20. Repeat the heat shock for all PCR plates with transformed cells 21. After 1 hour of incubation, proceed with plating Plating bacteria (~ 1 h) 1. Take the agar plates out of 4°C and let them dry (latest after the heat shock) 2. Label the plates (i.e. CL01TR_01.1a / CL01TR_01.1b & CL01TR_01.2a / CL01TR_01.2b) a. G&'"+V2-/'".3-%'+"&-E'"RW":'33+"!"H"+V2-/'".3-%'+"$,/"?L"XY":'33".3-%'"0''4'4 3. Place the agar plate on a paper grid with numbers and letters a. You will know better which grid field corresponds to which plate field 4. Add the glass beads to the grid fields (between 4-12 glass beads/ field is ok) 5. Add 70µl of the transformation to each field a. If you are slow it is better to work column by column i. Add glass beads, add bacteria, shake ii. You can use the lid as protection that the glass beads don’t “jump” in the other column 6. Shake the plate a. Hold and shake the plates with both hands b. Check that all beads in all wells are moving c. Do not shake too long 7. Press the lid on the agar plate and turn the plate over 8. Take the bottom of the agar plate away 9. Transfer the glass beads in a big glass beaker 10. Clean the lid with 70% Ethanol 11. Cover the agar plate with the lid 12. Repeat steps 4-11 for all plates / columns 13. Incubate overnight at 37°C upside down 14. Add 70% ethanol to the glass beads, wash with water, transfer into a dry glass bottle and send them for autoclaving 11 219 Day 4 Colony picking and inoculation (~ 2-3 h) Check list ● LB medium (1,5 ml per well, 150ml per plate) ● Toothpicks for picking ● Deepwell plates (Deepwell plates that are round on top and bottom, Starlab # E2896-2110) ● 1250µl digital multichannel pipette (Track) with tips Steps: The steps are best done with one or two additional people checking that the right well is picked and put into the correct well in the deepwell plate 1. Experimental person takes agar plates and uses computer script and enter which well has colonies (i.e. A1 - yes, A2 - no) a. Name of the script: script_B_picking_script.bat b. Can be run on lab desktop PC or via remote desktop from personal computer c. Takes ~ 1 hour d. possible break point, leave the agar plates at 4°C over the weekend 2. Use the script that makes the rearray for your experiment to create a new plate layout a. Name of the script: b. Make sure that the rearray information is saved in the expr_clone_info MySQL DB table 3. Use aseptic bench working technique 4. Label the deepwell plates (i.e. CL01GExDW_01.1a / CL01GExDW_01.1b; CL01GExDW_01.2a / CL01GExDW_01.2b) 5. Fill 1,5 ml LB-Medium in the wells a. Use the 1250µl digital multichannel pipette 6. Pick one colony from the first well a. Using a toothpick b. If you want to prepare 2 identical plates: stir in the corresponding well of the deep- well for a few seconds, then pick the same colony with the same toothpick into the second pick plate c. With the new 96 MiniPrep Kit you should get enough DNA with one deepwell plate d. You can leave the toothpick in the deepwell until you are done with one column 7. Continue with the next well 8. Repeat until all clones are picked 9. Cover the deepwell plate with breathable foil 10. Incubate @ 37˚C at 700rpm in the incumixer for 24h a. This conditions are important for successful MiniPrep 12 220 Day 5 Glycerol stock, Miniprep (~ 2 hours per plate) prepare glycerol stock before miniprep! Material needed: ● 40% glycerol steril (50µl per well, 5ml per plate) ● Costar plates for glycerol stock ● Alu foil to cover glycerol stock ● 1250µl digital multichannel pipette (Track) with tips ● Qiagen 96 well Miniprep kit ● Plate inserts for big centrifuge ● Big glass beaker ● Multipette with 5ml tips ● Alu foil for resuspension ● Costar plate for elution ● Vacuum (pump set to 300 mbar) ● Waste tray = square reservoir (can be autoclaved) Steps: 1. Work under aseptic bench working conditions 2. Get deepwell plates from the incubator 3. Prepare the glycerol stock plates (i.e. CL01GEx_01.1 & CL01GEx_01.2) a. By adding 50 µl of 40% glycerol to all required wells of a new costar plate b. Check if the bacteria are in suspension - if not, vortex (cover with alu or plastic foil before vortexing) c. Add 50 µl of the incubated bacteria culture to the corresponding wells and close the plate with alu-foil. d. Shake 30sec at 750rpm on the Thermomixer e. Freeze @ -80˚C 4. Centrifuge the deepwell plate @ 2100 xg for 5 min. 5. During centrifugation: Prepare the Qiavac Multiwell with Turbo filter 96 plate and S-Block QIAvac 96: a. Seal unused wells with additional tape b. Note for those using the unused well from a used plate: Because there are many vacuum steps in the procedure and the air flows better through previously-used wells (now empty) than the wells that are in use now, make sure that you tape the previously-used wells so that the airflow passes through the wells that you want. Otherwise, the air will tend to flow through the previously-used wells and reduce the efficacy of vacuum suction. 6. Pour out medium into beaker, tap dry the plate surface with paper towel a. If you have 2 identical deepwell plates: add the content of the second deepwell plate (CL01GExDW_01.1b) to the corresponding wells of the first plate (using digital multichannel to reduce the number of pipetting steps). Centrifuge @ 2100rpm for 5 min. b. Pour out the medium into a beaker c. Tap the plate on a paper towel to empty completely 7. Add 300 µl of buffer P1 to each well 13 221 a. Using the multipette or digital multichannel 8. Close plate with alu foil 9. Vortex to completely resuspend the bacteria 10. Remove foil 11. Add 300 µl of buffer P2 to each well a. Using the multipette or digital multichannel 12. Close the plate with the plastic foil from the kit 13. Invert 6-8 times 14. Incubate 5min at room temperature. a. Do not let the lysis take longer than 5 min b. Count in time from first well having received the lysis buffer 15. Remove foil 16. Tap dry the plate top 17. Add 300 µl of buffer S3 to each well a. Using the multipette or multichannel 18. Close the plate with the plastic foil from the kit 19. Invert 6-8 times 20. Remove foil and tap dry the plate top 21. Transfer content of each well in the corresponding well in the Turbo Filter 96 plate a. Using the digital multichannel (set to 1000µl) 22. Apply vacuum a. Pump set to 300 mbar b. To suck liquid in the S-block c. Make sure all liquid has passed the filter plate 23. Close vacuum 24. Remove filter-plate from assembly 25. Discard the filter plate 26. Remove S-Block a. DNA is here 27. Install waste try in the assembly 28. Install Plasmid Plus 96 plate in the assembly 29. Seal and label unused wells with tape 30. Add 300 µl of buffer BB to each well in S-Block a. Using the multipette or digital multichannel 31. Close the S-Block with the plastic foil from the kit 32. Invert 1-3 times 33. Remove foil 34. Tap dry the S-Block on top 35. Transfer content of each well in the corresponding well in the Plasmid Plus 96 plate a. Using the digital multichannel (set to 1250µl) 36. Apply vacuum a. Pump set to 300 mbar b. To suck liquid in the waste tray c. Make sure all liquid has passed the plate 37. Close vacuum 38. Transfer 900 µl of buffer PE in each well in the Plasmid Plus plate a. Using the digital multichannel 39. Apply vacuum a. Pump set to 300 mbar b. To suck liquid in the waste tray c. Make sure all liquid has passed the plate 40. Close vacuum 41. Empty waste tray 42. Pat dry the nozzles of the Plasmid Plus plate until now liquid can be seen on the paper towel 43. Put back the waste tray and assemble 44. Apply vacuum for 10 min a. Pump set to 300 mbar b. To dry the filter 45. Close vacuum 46. Lift the top plate from the base - but not the Plasmid Plus plate from the top plate! 14 222 47. Vigorously tap the top plate on a stack of absorbent paper until no more drops come out a. Blot the nozzles of the Plasmid Plus plate with clean absorbent paper 48. Remove the waste tray 49. Place 2 “old” costar plates (one with lid and one without lid) in the assembly a. To reach the required height, the nozzles should reach the wells of the costar plate 50. Place your elution plate (i.e. CL01GExMP_01.1) in the assembly and reassemble 51. Add 70µl of water/EB-buffer to the center of each well of the Plasmid Plus plate a. Using a manual multichannel 52. Let stand for 3 min 53. Apply vacuum for 1 min 54. Close vacuum 55. Disassemble the Qiavac Multiwell to get your DNA Stop point. DNA can be frozen @ -20˚C and stored. Nanodrop measurement 1. Using part 1 of script C, create a template for the Nanophotometer and save it in the Nanophotometer folder on the group drive (i.e./imb- luckgr/NanoPhotometer/HTP_data/CL100/) 2. Thaw plates with DNA (i.e. CL01GExMP_01.1 & CL01GExMP_01.2) 3. Centrifuge 3min @ 3000g in the big centrifuge 4. Load the correct measuring template to the NanoPhotometer a. On the NanoPhotometer, click ‘Nucleic Acid’, then swipe right and click the top right button that looks like a barcode. Click ‘Sample’ and then click ‘Import’. Select ‘Network_Groupdrive’ to find the NanoPhotometer folder in the group drive mentioned in point 1. There you can find your measuring templates and load them into the Nanophotometer for measurement 5. Measure the DNA concentration 6. Save the data to the group drive in the corresponding folder a. Save the measurement in the same folder so that you can access it through the groupdrive too b. If the folder ‘Network_Groupdrive’ does not appear on the Nanophotometer, try restarting it 7. If needed you can concentrate your DNA: a. Place the plate (without lid) in the dessicator b. Turn on vacuum and let evaporate until the desired volume/concentration is reached c. ~36h for 20-25µl reduction in volume d. Ask Christian for help, if needed 15 223 Day 6 DNA dilution and sequencing The first sequencing is done with both backbone primers (forward and reverse), full coverage sequencing for inserts is done after results come back for those that need it 1. Use part 2 of script C to calculate the dilutions needed 2. Make sure that the measured DNA concentrations are uploaded to the expr_clone_info MySQL DB table 3. Prepare the dilutions (i.e. CL01GExDil_01.1 & CL01GExDil_01.2) a. according to the template you created b. DNA concentration should be around 100 ng/µl 4. For the expression test you will need to dilute the NL plate once more a. Option 1. 1:10 (you take 1µl for expression test) b. Option 2. 1:25 (you take 2µl for expression test) DNA stock 1. Label PCR plates with labels for DNA stock (i.e. CL01GExSt_01.1 & CL01GExSt_01.2) 2. Pipette 10µl of the not diluted DNA (CL01GExMP_01.1 & CL01GExMP_01.2) to the stock plates 3. Close plates with alu foil and give to Mareen for storage Sequencing Each plate has to be submitted individually to StarSeq. You will get a zip file containing one .ab1 and .seq file for each sample submitted in the plate. You can use the plate barcodes for plates with more than 78 samples or you can submit individual barcodes for plates with <78 samples. If you are sending a plate to Starseq you have to have at least 48 samples on the plate Steps: 1. Prepare an Excel file (one for each sequencing run) with the file names of your sequencing samples in 96-well format. Suggested file names: e.g. mCit-[ORF ID]-F for the mCit construct and forward read. The layout should correspond to what you have generated after picking the colonies (i.e. CL01GExh_01.1) 2. Label the PCR plates for sequencing (i.e. CL01GExSF_01.1 & CL01GExSF_01.2; CL01GExSR_01.1 & CL01GExSR_01.2). 3. Add 1µl of the corresponding primer to the plates a. primer # 44 NanoLuc-398fwd - for N-terminal NL fusion b. primer # 47 mCitrine-547fwd for N-terminal mCit fusion c. primer # 51 pEXP_rev for no C-terminal fusion d. Using the multipette and combitip 1ml e. Alternatively, one can also aliquot the primers into PCR tubes and use digital multichannel to distribute the primers into the wells 4. Add 6 µl of the diluted DNA to the sequencing plate a. I.e. from CL01GExDil_01.1 & CL01GExDil_01.2 b. Using manual multichannel pipette 5. Close the sequencing plate using the alu foil 6. Order the sequencing on the StarSeq webpage a. Use the Excel file created in step 1 to copy paste the plate layout into their web form 7. Pack plate together with paperwork in a padded envelope a. To avoid the foil getting pierced b. Submit each plate as an individual sequencing run c. When submitting multiple plates, results will likely not come back all by next morning but over the next 24-36h 8. Process the sequencing results with the Sanger seq processing pipeline a. Instructions can be found in labfolder under templates 9. Make sure to update results accordingly in the expr_clone_info MySQL DB table 16 224 Day 7 Transfection Expression test CS notes: - I did get 6x106 HEK293 cells out of 1 T-25 flask lately. - I found it more convenient to do the triplicates in separate plates. - I did not mix DNA with Lipofectamin before, only when I put the DNA to the final incubation plate. - While I did NL and mCit the same day, I pipetted them separately as it is very hard to handle 6 plates at the same time. - The volumes I put here (most of the time) depend on your transfection ratio and DNA concentrations used. I did NL-constructs 4ng/µl, mCit 100ng/µl, pcDNA 200ng/µl; 2:50 ratio Steps: 1. Prepare the layout of your plate with the controls a. controls: NL-stop + pcDNA, mCit-stop + pcDNA, well with only pcDNA, well with only cells b. The controls you put depend on your experiment and space you have on the plate. If you have doubts, talk to Katja 2. Prepare the DNA for your controls, PA-mCit-Stop, NL-Stop and pcDNA3.1 a. can be prepared in PCR stripes - then you can later use the multichannel pipette 3. Prepare an additional dilution of the NL-constructs to 4ng/µl (if you haven’t already) 4. Take a PCR plate 5. Add the pcDNA (3µl) to the wells first. a. Using multipette or multichannel pipette b. Doing the pcDNA first allows you to do everything with one tip. Try to get the DNA to the bottom of the plate. 6. Add the NL-Stop (for the mCit-constructs) or mCit-Stop (for the NL-constructs) a. 2µl to the wells b. Using the multipette or multichannel pipette c. for multipette: using one tip is possible for this as the only possible contamination would be with pcDNA which can be avoided by putting the DNA at the wall of the wells away from the pcDNA 7. Add your diluted construct DNA (mCit or NL) a. 2µl if you are using the DNA concentrations written on top b. Using the multichannel pipette 8. Add the DNA for the controls to the wells 9. Tap plate to mix all DNA in the bottom of the well 10. Add 100µl Optimem to each well 11. Prepare the Lipofectamine-Optimem mixture in a 15ml Falcon tube a. You do not need to do quadruples here. This saves some lipofectamine b. Example: 78 wells/plate x 0.5µl Lipo/well x 3 plates = 117µl Lipo 78 wells/plate x 25µl Optimem/well x 3 plates = 5.85ml Optimem Now add some for the reservoir: => 120µl Lipofectamine + 6ml Optimem 12. Label the plates for incubation (i.e. LuXXXrXX) 13. Add Lipo-Opti mixture to a 10ml reservoir by pipetting a. Decanting is suboptimal as it leaves some residual mixture in the Falcon tube 14. Add 25µl Lipo-Opti mixture to each well of the incubation plates (white 96well plate for LuTHy) a. Using the multichannel pipette 15. Take out cells, wash and add trypsin 16. While the trypsination is ongoing: a. Transfer DNA-Opti mixture in the incubation plates b. You can use a digital multichannel (Trick) to speed it up (aspirate 75µl, dispense 3x25µl) c. Predispense step is needed to get accurate amounts for the first dispense of the multi-dispense 17 225 d. The 20min time limit starts now 17. Quench trypsin, resuspend cells, count cells, centrifuge and adjust concentration to 2.67x105 cells/ml in phenol-red free DMEM medium. 18. Decant the cells in a 25ml reservoir 19. Add 150µl cell suspension to the plates a. Using the digital multichannel b. aspire 450µl, then dispense 3x150µl doing the triplicate without changing tips c. Use program called” LUTHY CELLS” in 1250µl digital multichannel pipette (Track). The program first resuspends the cell multiple times (called ‘Mix’ in the program), and then aspirates 450µl for the repeat dispense of 3x150µl 20. Incubate for 48h 21. Proceed with measurement as usual for LuThy assay 22. For the LuTHy processing scripts to be able to process your data, KL numbers have to be generated for all the constructs on your plate. 23. Make sure the KL numbers are generated and saved in the LUCK_DB.Luck_lab_plasmids table along with all available information. Make sure to update the LUCK_DB.Luck_lab_plasmids table according to new sequencing and other experimental results you obtain, i.e. enter if the plasmid is full length sequenced or partial, add mutation information, let Katja know, if ORF turned out to be a different ORF and which ORFs need a new ORF ID. Let Katja know about KL numbers that should be deleted because the insert could not be confirmed. 18 226 5.1.2 The medium-throughput site-directed mutagenesis 227 Site-directed mutagenesis (without Kit) Day 0 Primer design Criteria for mutagenesis primers: - Primer length should be 32-36 nt. If it is shorter, the mutation might not be cloned properly! - GC content of primer should be between 40 to 60% - Difference in melting temperature between the forward and reverse primers should be ideally less than 5ºC. (use NEB Tm calculator: https://tmcalculator.neb.com/#!/main and select Phusion as the product group for the melting temperature of primer) - The annealing temperature of PCR reaction should be set at the value which corresponds to 5ºC lower than the lowest melting temperature among the primers - The 3` end of the primer should ideally be C or G - The annealing temperature should be below 70°C if possible primer order info, if you have 24 or more primers (IDT company) If you want to orde primers in plate price wise: The prices for DNA oligos when in tubes or in plates can be seen in our website here . The prices are usually a bit lower for oligos in plates however, one should look at the final cost of the whole order. For example when ordering in plates there is a minimum of 24 oligos that should occupy the plate. So in the long run, plates are not always the cheaper option. dry or wet primers When ordering in plates you can choose your oligos to be normalized to a certain amount. That can be either as a pellet (dry) or resuspended (wet). In that way you will avoid needing higher volume than the capacity of the well. These can be adjusted in the "Plate Specifications" button when ordering the plate oligo . I would say that there are no pros and cons for primers in pellet or in solution in terms of primer performance, stability and so on. It is more a matter of experimental needs and set up. Some researchers prefer to receive their oligos ready to use whereas others want to resuspend them in a certain buffer or in a specific dilution. When automation and robot handling is included, people prefer having plates than tubes. On the other hand, when having the oligos in plates and manual pipetting is done the chances for contamination or spillage can be higher. Design of a point mutation (non-kit way) - Design forward and reverse primers that overlap at the site of mutation. Try to locate the mutation to be at the middle of the overlapping region so that the mutation is flanked by complementary sequences. The overlapping part (that contains your mutation) should be 20-2215-22nt nt long. - Here is an example to mutate L152E in benchling. To know what codon codes for E, right click on L152 and select ‘Change amino acid’. Remember to change Organism to Homo sapiens. There you can find the codon that codes for the amino acid that you want to mutate to, and the best codon change to achieve that amino acid substitution. Do take into consideration the number of bases that need to be changed for the amino acid substitution and the frequency of codon to ensure optimal mutagenesis 228 - Design a deletion (non kit way) - The forward and reverse primers have to overlap at the overhang so that the synthesized strands can circularize after amplification. The non-overhang region should have ~20 nt and the overhang (overlap) region should have 15 nt. - Here is a schematic showing how the primers with overhang should be designed. (the scheme and explanation are retrieved from takara : https://www.takarabio.com/learning-centers/cloning/applications-and-technical-notes/mutagen esis-with-in-fusion-cloning) Design a Deletion with Q5 kit - Design forward and reverse primers that exclude the deletion site. - Here is the schematic. 229 Design sequencing primer - use primer design tool - follow the instruction of the primer designer tool - in labfolder → “templates” → “instructions” → “How to design primers with PrimerDesigner” Primer order from Sigma 1. design the primer 2. order Oligos in solution (water) a. the price is the same and it will save time to you 3. import all needed data into the excel file from sigma a. there are 2 excel files: 1x for single oligo order and 1x to order oligos in plate b. can be found on the group drive (Primer_AG_Luck → Sigma_order_template or DNAPLATE_96well_8ch_template_sigma) c. the excel sheet to order single oligos is also saved on the intranet (Administration → Purchasing → Oligo Ordering → Sigma Oligo template) d. the link to order oligos in plate: [DNA-Oligos in Platten (sigmaaldrich.com)](https://www.sigmaaldrich.com/DE/de/configurators/plate?product =dnaplate&activeLink=sequenceUpload) e. Oligos in plate: i. ask for quote for your order ii. after uploading the file you need to set the “scale”. “purification” and “format” iii. scale = 0,025; purification = desalt; format = in solution (water) 4. upload the excel sheet to the order manager under service agreements a. oligos are ordered every tuesday and thursday after 2pm b. it might take up to 1 week to receive the oligos, oligos in plate take ~ 2days longer than oligos in tubes 5. after receiving the oligos you have to dilute them Primer order from IDT 1. design the primer 2. register with IDT 3. go to “Products and Service” → “DNA and “RNA” → “Custom DNA oligos” → “DNA Oligos” → “Single-stranded DNA” (you can choose between tubes and plates) → press the button “order now” → 4. order primer in tubes: a. enter all required informations in the fields 230 b. you can use the button “bulk input” if you have several oligos to order i. Scale: 25nmole DNA oligo ii. Formulation: you can choose “None” = dry or “LabReady (100µM in IDTE, pH8,0)” iii. Purification: Standard desalting iv. Sequence: 5. order primer in plates: a. choose plates b. 25nmole → order c. download the excel sample ordering template (under upload plates) d. fill out the excel sheet e. upload the excel sheet f. check the upload g. if necessary make changes 6. add to order 7. order the oligos via internet, add the email of the purchase department “einkauf@imb-mainz.de” in the distribution list for order confirmations 8. enter your IDT order in the Order Manager Primer dilution in plates 1. take a new PCR plate 2. label the plate (i.e. MU01PrDilF_01 and MU01PrDilR_01) 3. add 90µl water in each well 4. add 10µl primer in the corresponding well 5. close the plate 6. mix/vortex 7. freeze until needed Preparation of template DNA 10ng/µl (i.e. MU01TD_01) 1. take a new PCR plate 2. label the plate (i.e. MU01TD_01) 3. add 9µl water in each well 4. add 1µl template DNA in the corresponding well a. take the template DNA from your diluted MiniPrep 100ng/µl 5. close the plate 6. mix/vortex 7. freeze until needed Day 1 PCR, DPN1 digestion and E-Gel Checklist PCR, DPN1 digestion and E-Gel: 96-well skirted PCR plates (3x) 5ml tube (Axygen, # SCT-5ml-S) or 50 mL falcon tube - to prepare PCR master mix, 50ml because of multipette PCR foil multichannel pipette 10µl 10µl pipette tips (4 boxes) multichannel pipette 50µl 100µl tips multipette combitips 1ml (to add the PCR Master Mix) combitips 0,1ml (to add DPN1) 100µl pipette 100µl pipette tips 1000µl pipette 1000µl pipette tips ice block - to keep PCR components in cold 231 PCR machine or Thermomixer PCR components DPN1 E-Gel 96 1% Agarose (GP) (invitrogen, # G700801) E-Gel 96 High range DNA marker PCR program: temperature time cycle step 98˚C 2min 1x initial denaturation 98˚C 30s 25x denaturation __ __˚C * 15s 25x primer annealing 72˚C 5min (1min/1kb) 25x extension 72˚C 5min 1x final extension 16°C ∞ 1x * Temperature depends on the primer, try to keep the temperature below 70°C when designing the primer if melting temp of 1primer is less than 69°C than annealing temp = 55°Cif higher = 63°C PCR reaction (50µl total) PCR components 1x (1well) x 100 primer (10µM) 2.5 µL 250 µL primer (10µM) 2.5 µL 250 µL template DNA (10ng) 1µl 100 µl dNTPs (10mM) 1 µL 100 µL 10x HF Buffer 10 µL 1000 µL High fidelity DNA polymerase 0.5 µL 55 µL H2O 32.5 µL 3250 µL (= 5x 650 µL) Master Mix PCR components 1x (1well) x 100 dNTPs (10mM) 1 µL 100 µL 10x HF Buffer 10 µL 1000 µL High fidelity DNA polymerase 0.55 µL 55 µL H2O 32.5 µL 3250 µL (= 5x 650 µL) Steps: 232 1. Label the PCR plate (i.e. MU01PCR_01) 2. Once the PCR components started to thaw vortex each PCR reagent 3. Prepare the master mix a. In 5ml tube or 50mL falcon tube 4. Pipette 44 µl of the master mix in each well of the PCR plate (on ice/cold block) a. 44,5µl is not possible with multipette b. Using the multipette and combitip 1ml 5. Add 2,5µl of each primer to the PCR plate a. Using the multichannel pipette b. Pipette from the primer working solution plate (i.e. MU01PrDilF_01, MU01PrDilR_01) 6. Add 1µl of purified template DNA (~10ng) a. Use multichannel pipette b. Pipette from the template DNA plate (i.e. MU01TD_01) 7. One well should be used as control (master mix without ORF) 8. Close the PCR plate with PCR foil a. be sure to close every column and row using the grey plastic “card” 9. Vortex the plate briefly 10. Centrifuge briefly 11. Run the PCR ( ~3 hours) if melting temp of 1primer is less than 69°C than annealing temp = 55°C if higher = 63°C DpnI digestion (using commercial DPN1) Steps: 1. Prepare a new PCR plate with DPN1 a. Can be done while the PCR is running b. Label the plate (i.e. MU01Dpn_01) c. Add 2µl DPN1 with the multipette to the DPN1 plate i. The multipette must touch the PCR plate while pipetting to ensure that the 2µl of DPN1 enters into each well 2. Add 50µl of PCR product to the plate with DPN1 a. Using the multichannel pipette 50µl 3. Incubate for 1h at 37°C (PCR machine or thermomixer) 4. Incubate for 20 min at 65°C (to stop the DPN1 reaction) Validation of the PCR product with E-gel - Info: - PCR products can be stored at 4°C for 48h, for longer time freeze PCR products - Document all wells that do not look ok on gel - Steps: 1. Label the E-Gel plate (i.e. MU01Gel_01) 2. Pipette 25 µl of blue 96 gel loading buffer in the E-Gel plate a. Using the multipette and 2,5ml Combitip b. Can be done while PCR is running 3. Add 6 µl of PCR product to each well a. Using the 10µl multichannel pipette 4. Install 96 well E-gel to the motherbase 233 5. Load 20µl PCR/buffer mix to each well a. Using the 50µl multichannel pipette 6. Load 20µl of E-Gel 96 High range DNA marker 7. All empty wells must also be filled with 20µl a. With buffer or loading dye 8. Insert the plug into the socket 9. Run gel for 12 min a. Program EG 10. Take picture with GelDoc Station 11. Analyze gel picture with the E-Editor 2.0 software a. On the desktop PC in the technical room b. Realign the bands and save it in your cloning project folder c. The software is pretty self-explanatory and has a manual available under the help button. Ask Katja for help. d. explanation how to do it by john 12. Decide if PCR was successful and whether it is worth proceeding 13. Document all wells that did not look ok a. Add this information to the respective MySQL table Preparing square agar plate (should be done at least the day before needed) Check list LB Agar (250ml/square plate) Square plates and divider Microwave 1. Take LB-Agar (250ml) from IMB media lab 2. Use aseptic bench working technique 3. Heat Agar in the microwave (program: soften/melt, 2= melt dark chocolate, 100 = 5,5 min; after 3x the agar is liquid) 4. Let it cool down (i.e. add a clean stirrer to the agar and place the bottle on the magnetic stirrer, adjust the temperature to 50°C and 250rpm) 5. Add antibiotic (250µl) when the agar is cooled down sufficiently and you are ready to pour the plates 6. Take out the plate from the plastic protection 7. Add agar to the plate (pop bubbles with a pipette tip or move them to the side) 8. Take out the grid from the plastic protection 9. Add the grid in the square plate with agar --> the grid does not stay down - weigh down the grid with something (i.e. a 250ml bottle) 10. Let the agar solidify 11. Store at 4°C (upside down) Day 2 Transformation and plating Checklist Transformation and plating: 48 well square plates with agar and antibiotic (2 plates are needed for 96 well plate) SOC medium (8ml/plate) 10ml reservoir DH5a (30µl) multichannel pipette 50µl 100µl pipette tips multichannel pipette 300µl 300µl pipette tips 200µl pipette 200µl pipette tips 234 glass beads 70% Ethanol Thermomixer/PCR machine at 42°C and 37°C Ice box Transformation: 1. Take out SOC medium (for one well = 80 µL, for 1 plate = 8 mL) and let it thaw at room temperature a. 50ml takes long time to thaw, could be placed at 4°C the afternoon before 2. Use aseptic bench working technique 3. Take out DH5α from -80°C a. Put them immediately on the ice b. Let them thaw c. Label the plate (i.e. MU01_TR01) 4. Take the plate after Dpn1 digestion (i.e. MU01Dpn_01) 5. Transfer the plates on ice 6. Transfer of the digested PCR product into the DH5a plate a. Us3 µL e a multichannel pipette b. No resuspension, no vortex when adding the PCR product into the DH5a c. Close the plate with alu foil 7. Incubate for 30 minutes on ice (bacteria with PCR product) 8. Meanwhile: set the thermoblock to 42°C for the heat shock and set another thermoblock to 37°C 9. 45sec at 42°C (heat shock) 10. Immediately move the plate on ice for 2 minutes 11. Pour SOC medium to the reservoir 12. Transfer 80 µL of SOC medium to each well a. Using a multichannel pipette b. Discard tips after each column 13. Transfer the plate to thermoblock to 37°C 14. Incubate for 1 hour shaking at 300 rpm (no shaking is also working) 15. After 1 hour of incubation, proceed with plating Plating bacteria (~ 1 h) 1. Take the agar plates out of 4°C and let them dry (latest after the heat shock) 2. Label the plates (i.e. MU01_TR_01a, MU01TR_01b) a. The square plates have 48 wells → 2 square plates for 1x 96 well plate needed 3. Place the agar plate on a paper grid with numbers and letters a. You will know better which grid field corresponds to which plate field 4. Add the glass beads to the grid fields (between 4-12 glass beads/ field is ok) 5. Add 70µl of the transformation to each field a. 70µl needs a bit longer to dry - do not turn immediately after shaking b. If you are slow it is better to work column by column i. Add glass beads, add bacteria, shake ii. You can use the lid as protection that the glass beads don’t “jump” in the other column 6. Shake the plate a. Hold and shake the plates with both hands b. Check that all beads in all wells are moving c. Do not shake too long 7. Press the lid on the agar plate and turn the plate over 8. Take the bottom of the agar plate away 9. Transfer the glass beads in a big glass beaker 10. Clean the lid with 70% Ethanol 11. Cover the agar plate with the lid 12. Repeat steps 4-11 for all plates / columns 235 13. Incubate overnight at 37°C upside down 14. Add 70% ethanol to the glass beads, wash with water, transfer into a dry glass bottle and send them for autoclaving Day 3 Colony picking and inoculation (~ 2-3 h) Check list LB medium (1,5 ml per well, 150ml per plate) Toothpicks for picking Deepwell plates (Deepwell plates that are round on top and bottom, Starlab # E2896-2110) 1250µl digital multichannel pipette (Track) with tips Steps: The steps are best done with one or two additional people checking that the right well is picked and put into the correct well in the deepwell plate 1. Experimental person takes agar plates and uses computer script and enter which well has colonies (i.e. A1 - yes, A2 - no) a. Name of the script: script_B_picking_script.bat b. Can be run on lab desktop PC or via remote desktop from personal computer c. Takes ~ 1 hour d. possible break point, leave the agar plates at 4°C over the weekend 2. Use the script that makes the rearray for your experiment to create a new plate layout a. Name of the script: b. Make sure that the rearray information is saved in respective MySQL table 3. Use aseptic bench working technique 4. Label the deepwell plates (i.e.MU01DW_01) 5. Fill 1,5 ml LB-Medium in the wells a. Use the 1250µl digital multichannel pipette 6. Pick one colony from the first well a. Using a toothpick b. If you want to prepare 2 identical plates: stir in the corresponding well of the deep-well for a few seconds, then pick the same colony with the same toothpick into the second pick plate c. With the new 96 MiniPrep Kit you should get enough DNA with one deepwell plate d. You can leave the toothpick in the deepwell until you are done with one column 7. Continue with the next well 8. Repeat until all clones are picked 9. Cover the deepwell plate with breathable foil 10. Incubate @ 37˚C at 700rpm in the incumixer for 24h a. This conditions are important for successful MiniPrep Day 4 96 well Miniprep and Nanodrop measurement for the MiniPrep please use the protocol “Miniprep_96well_plate” in labfolder Day 5 DNA dilution and sanger sequencing 236 5.1.3 Figures Expression profiles Figure 5.1: Expression of wild-type proteins and mutants. (A) The expres- sion levels of wild-type (WT), mutants, and patient variants fused to NanoLuc were measured. The x-axis represents the names of the wild-type, mutants, and variants, while the y-axis indicates the luminescence intensity for each protein. Each protein was co-expressed with an empty mCit-control to verify expression. (B) The expres- sion levels of wild-type (WT), mutants, and patient variants fused to mCit were assessed. The x-axis represents the names of the wild-type, mutants, and variants, while the y-axis shows the fluorescence intensity for each protein. To verify expres- sion, each protein was co-expressed with an empty NL control. 237 Figure 5.2: Expression of wild-type proteins and mutants protein pairs during BRET saturation assay. (A) The bar plots indicate the luminescence intensity for NL-CTBP1 wild-type and mutants, and the fluorescence intensity for mCit fused partners wild-type and mutants as well. (B) The bar plots indicate the luminescence intensity for NL-WWOX wild-type and mutants, and the fluorescence intensity for mCit fused partners wild-type and mutants. 238 Figure 5.3: The validation of predicted interface of WWOX-SNRPC in- teraction and the variant effect on this ppi(A) Schematic representation of the WWOX-SNRPC interaction and putative interface. The protein containing the predicted interacting motif is shown in green, and the domain-containing protein is in grey. The question mark indicates that the AF-MM fragmentation approach was used to predict the potential interface. B) AF-MM predicted interface structural models: (Bi) The WWOX WW domain (grey) with highlighted mutated residues (blue) for domain validation and the motif (green). (Bii) The same predicted struc- ture illustrating pathogenic (red) and VUS (grey) variants in the WWOX WW domain. (C-D) BRET saturation assay data and expression profiles: (C) BRET saturation curves showing the effects of WW domain mutants (see legend), motif deletion, and N-terminal truncation of SNRPC on binding affinity. The effects of pathogenic (red) and VUS (grey) variants on the interaction are also shown. (D) Expression profiles of wild-type and mutant interactions, with color coding corre- sponding to panel (C). (E-F) Validation of the WWOX-CSNK2B interaction:(E) BRET saturation curves showing the effects of WW domain mutants (see legend) and pathogenic (red) and VUS (grey) variants on the interaction with CSNK2B. (F) Expression profiles of wild-type and mutant interactions in the WWOX-CSNK2B interaction, color-coded as in panel (E). 239 Figure 5.4: Expression of wild-type proteins and mutants protein pairs during BRET saturation assay. (A) The bar plots indicate the luminescence intensity for NL-IQCB1 wild-type and mutants, and the fluorescence intensity for mCit fused partners wild-type and mutants as well. (B) The bar plots indicate the luminescence intensity for NL-SPOP wild-type and mutants, and the fluorescence intensity for mCit fused partners wild-type and mutants. (C) The bar plots indicate the luminescence intensity for NL-PPP3CA wild-type and mutants, and the fluores- cence intensity for mCit fused partners wild-type and mutants. (D) The bar plots indicate the luminescence intensity for NL-REPS1 wild-type and mutants, and the fluorescence intensity for mCit fused partners wild-type and mutants. 240 Figure 5.5: Expression of wild-type proteins and variants pairs during BRET saturation assay. (A)The bar plots indicate the luminescence intensity for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit- LITAF wild-type and variants as well. (B) The bar plots show the expression levels of WWOX-DAZAP2 wild-type and variant interactions. The luminescence intensity for NL-WWOX wild-type and mutants, and the fluorescence intensity for mCit- DAZAP2 wildtype and VUS Y46C. (C) The bar plots indicate the luminescence intensity for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-CPSF6 wild-type and VUS. (D) The bar plots indicate the luminescence in- tensity for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-HOXA1 wild-type. (E)The bar plots indicate the luminescence intensity for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-SNRPC wild-type. (F) The bar plots indicate the luminescence intensity for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-CNSK2B wild-type. 241 Figure 5.6: Expression of wild-type proteins and variants pairs during BRET saturation assay. (A)The bar plots indicate the luminescence intensity for NL-IQCB1 wild-type and variants, and the fluorescence intensity for mCit-fused partners wild-type and variants. (B) The bar plots indicate the luminescence in- tensity for NL-CTBP1 wild-type and variants, and the fluorescence intensity for mCit-fused partners wild-type and variants. 242 Figure 5.7: Expression of wild-type proteins and variants pairs during BRET saturation assay. The bar plots indicate the luminescence intensity for NL-REPS1 wild-type and variants of TRAPPC2L, the luminescence intensity for NL-SPOP wild-type and variants, and the fluorescence intensity for mCit-MYD88 partners wild-type and variants. 243 Bibliography Adzhubei, Ivan A, Steffen Schmidt, Leonid Peshkin, Vasily E Ramensky, Anna Gerasimova, Peer Bork, Alexey S Kondrashov, and Shamil R Sunyaev (2010). “A method and server for predicting damaging missense mutations.” In: Nature Methods 7.4, pp. 248–249. doi: 10.1038/nmeth0410-248. Akdel, Mehmet, Douglas E V Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L Good, Roman A Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H M Rodrigues, Alistair S Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Du- rairaj, David B Ascher, Janet M Thornton, Norman E Davey, Amelie Stein, Arne Elofsson, Tristan I Croll, and Pedro Beltrao (2022). “A structural biol- ogy community assessment of AlphaFold2 applications.” In: Nature Structural & Molecular Biology 29.11, pp. 1056–1067. issn: 1545-9993. doi: 10.1038/s41594- 022-00849-w. Alberts, Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter (2002). Molecular Biology of the Cell. Garland Science. isbn: 0- 8153-3218-1, 0-8153-4072-9. Apic, G, J Gough, and S A Teichmann (2001). “Domain combinations in archaeal, eubacterial and eukaryotic proteomes.” In: Journal of Molecular Biology 310.2, pp. 311–325. doi: 10.1006/jmbi.2001.4776. Arimura, T, T Nakamura, S Hiroi, M Satoh, M Takahashi, N Ohbuchi, K Ueda, T Nouchi, N Yamaguchi, J Akai, A Matsumori, S Sasayama, and A Kimura (2000). “Characterization of the human nebulette gene: a polymorphism in an actin- binding motif is associated with nonfamilial idiopathic dilated cardiomyopathy.” In: Human Genetics 107.5, pp. 440–451. doi: 10.1007/s004390000389. Babu, M Madan, Richard W Kriwacki, and Rohit V Pappu (2012). “Structural biology. Versatility from protein disorder.” In: Science 337.6101, pp. 1460–1461. doi: 10.1126/science.1228775. Babu, M Madan, Robin van der Lee, Natalia Sanchez de Groot, and Jörg Gsponer (2011). “Intrinsically disordered proteins: regulation and disease.” In: Current 244 Opinion in Structural Biology 21.3, pp. 432–440. doi: 10.1016/j.sbi.2011. 03.011. Bagowski, Christoph P., Wouter Bruins, and Aartjan J. W. Te Velthuis (2010). “The nature of protein domain evolution: shaping the interaction network”. In: Current Genomics 11.5, pp. 368–376. doi: 10.2174/138920210791616725. Berg, J M and H A Godwin (1997). “Lessons from zinc-binding peptides.” In: Annual review of biophysics and biomolecular structure 26, pp. 357–371. doi: 10.1146/ annurev.biophys.26.1.357. Björklund, Asa K, Diana Ekman, Sara Light, Johannes Frey-Skött, and Arne Elofs- son (2005). “Domain rearrangements in protein evolution.” In: Journal of Molec- ular Biology 353.4, pp. 911–923. doi: 10.1016/j.jmb.2005.08.067. Blake, C. C. F., D. F. Koenig, G. A. Mair, A. C. T. North, D. C. Phillips, and V. R. Sarma (1965). “Structure of Hen Egg-White Lysozyme: A Three-dimensional Fourier Synthesis at 2 Å Resolution”. In: Nature 206, pp. 757–761. doi: 10. 1038/206757a0. Braun Tasan, Murat, Matija Dreze, Miriam Barrios-Rodiles, Irma Lemmens, Haiyuan Yu, Julie M Sahalie, Ryan R Murray, Luba Roncari, Anne-Sophie de Smet, Kavitha Venkatesan, Jean-François Rual, Jean Vandenhaute, Michael E Cusick, Tony Pawson, David E Hill, Jan Tavernier, Jeffrey L Wrana, Frederick P Roth, and Marc Vidal (2009). “An experimentally derived confidence score for binary protein-protein interactions.” In: Nature Methods 6.1, pp. 91–97. doi: 10.1038/nmeth.1281. Bulman, D E, S B Gangopadhyay, K G Bebchuck, R G Worton, and P N Ray (1991). “Point mutation in the human dystrophin gene: identification through western blot analysis.” In: Genomics 10.2, pp. 457–460. doi: 10.1016/0888- 7543(91)90332-9. Bystroff, Christopher and Anders Krogh (2008). “Hidden Markov Models for Pre- diction of Protein Features”. In: Methods in Molecular Biology. Vol. 413. MIMB. Humana Press, pp. 173–198. doi: 10.1007/978-1-59745-582-4_12. Campbell, Iain Donald and Martin Baron (1991). “The structure and function of protein modules”. In: Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 332.1263, pp. 199–203. issn: 0962-8436. doi: 10. 1098/rstb.1991.0045. Chaisson Sanders, Ashley D, Xuefang Zhao, Ankit Malhotra, David Porubsky, To- bias Rausch, Eugene J Gardner, Oscar L Rodriguez, Li Guo, Ryan L Collins, Xian Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N Kronenberg, Xiangmeng Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M Wenger, Alex R Hastie, Danny Antaki, Thomas Anantharaman, Peter A Audano, Harrison Brand, Stu- art Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xintong Chen, Chen- Shan Chin, Zechen Chong, Nelson T Chuang, Christine C Lambert, Deanna M 245 Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, David U Gorkin, Madhusudan Gujral, Victor Guryev, William Haynes Heaton, Jonas Ko- rlach, Sushant Kumar, Jee Young Kwon, Ernest T Lam, Jong Eun Lee, Joyce Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li, Patrick Marks, Karine Viaud- Martinez, Sascha Meiers, Katherine M Munson, Fabio C P Navarro, Bradley J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou, Andy W C Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stütz, Diana C J Spierings, Alistair Ward, AnneMarie E Welch, Ming Xiao, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Ernesto Lowy, Sergei Yakneen, Steven McCarroll, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul Flicek, Ken Chen, Mark B Gerstein, Pui-Yan Kwok, Peter M Lansdorp, Gabor T Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E Devine, Michael E Talkowski, Ryan E Mills, Tobias Marschall, Jan O Korbel, Evan E Eichler, and Charles Lee (2019). “Multi-platform discovery of haplotype-resolved structural variation in human genomes.” In: Nature Communications 10.1, p. 1784. issn: 2041-1723. doi: 10.1038/s41467-018-08148-z. Chen Li, S, Y Chen, P L Chen, Z D Sharp, and W H Lee (1996). “The nuclear localization sequences of the BRCA1 protein interact with the importin-alpha subunit of the nuclear transport signal receptor.” In: The Journal of Biological Chemistry 271.51, pp. 32863–32868. doi: 10.1074/jbc.271.51.32863. Chen, Siwei, Robert Fragoza, Lambertus Klei, Yuan Liu, Jiebiao Wang, Kathryn Roeder, Bernie Devlin, and Haiyuan Yu (2018). “An interactome perturbation framework prioritizes damaging missense mutations for developmental disor- ders.” In: Nature Genetics 50.7, pp. 1032–1040. issn: 1061-4036. doi: 10.1038/ s41588-018-0130-z. Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Ap- plebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, Rosalia G Schneider, Andrew W Senior, John Jumper, Demis Hassabis, Push- meet Kohli, and Žiga Avsec (2023). “Accurate proteome-wide missense variant ef- fect prediction with AlphaMissense.” In: Science 381.6664, eadg7492. issn: 0036- 8075. doi: 10.1126/science.adg7492. Chien, Bartel, Sternglanz, and Fields (1991). “The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest”. In: Proceedings of the National Academy of Sciences U S A 88.21, pp. 9578–9582. doi: 10.1073/pnas.88.21.9578. Choi Olivet, Julien, Patricia Cassonnet, Pierre-Olivier Vidalain, Katja Luck, Luke Lambourne, Kerstin Spirohn, Irma Lemmens, Mélanie Dos Santos, Caroline De- meret, Louis Jones, Sudharshan Rangarajan, Wenting Bian, Eloi P Coutant, Yves L Janin, Sylvie van der Werf, Philipp Trepte, Erich E Wanker, Javier De Las Rivas, Jan Tavernier, Jean-Claude Twizere, Tong Hao, David E Hill, Marc 246 Vidal, Michael A Calderwood, and Yves Jacob (2019). “Maximizing binary inter- actome mapping with a minimal number of assays.” In: Nature Communications 10.1, p. 3907. doi: 10.1038/s41467-019-11809-2. ClinVar Miner (2024). ClinVar Miner. Accessed: 2024-09-05. Consortium, 1000 Genomes Project, Adam Auton, Lisa D Brooks, Richard M Durbin, Erik P Garrison, Hyun Min Kang, Jan O Korbel, Jonathan L Marchini, Shane McCarthy, Gil A McVean, and Gonçalo R Abecasis (2015). “A global reference for human genetic variation.” In: Nature 526.7571, pp. 68–74. issn: 0028-0836. doi: 10.1038/nature15393. Copley, Richard R, Tobias Doerks, Ivica Letunic, and Peer Bork (2002). “Protein do- main analysis in the era of complete genomes.” In: FEBS Letters 513.1, pp. 129– 134. doi: 10.1016/s0014-5793(01)03289-6. Davey, Norman E, M Madan Babu, Martin Blackledge, Alan Bridge, Salvador Capella-Gutierrez, Zsuzsanna Dosztanyi, Rachel Drysdale, Richard J Edwards, Arne Elofsson, Isabella C Felli, Toby J Gibson, Aleksandras Gutmanas, John M Hancock, Jen Harrow, Desmond Higgins, Cy M Jeffries, Philippe Le Mercier, Balint Mészáros, Marco Necci, Cedric Notredame, Sandra Orchard, Christos A Ouzounis, Rita Pancsa, Elena Papaleo, Roberta Pierattelli, Damiano Piovesan, Vasilis J Promponas, Patrick Ruch, Gabriella Rustici, Pedro Romero, Sirarat Sarntivijai, Gary Saunders, Benjamin Schuler, Malvika Sharan, Denis C Shields, Joel L Sussman, Jonathan A Tedds, Peter Tompa, Michael Turewicz, Jiri Von- drasek, Wim F Vranken, Bonnie Ann Wallace, Kanin Wichapong, and Silvio C E Tosatto (2019). “An intrinsically disordered proteins community for ELIXIR.” In: F1000Research 8. doi: 10.12688/f1000research.20136.1. Davey, Norman E, Martha S Cyert, and Alan M Moses (2015). “Short linear motifs - ex nihilo evolution of protein regulation.” In: Cell Communication and Signaling 13, p. 43. doi: 10.1186/s12964-015-0120-z. Davey, Norman E, Niall J Haslam, Denis C Shields, and Richard J Edwards (2011). “SLiMSearch 2.0: biological context for short linear motifs in proteins.” In: Nu- cleic Acids Research 39.Web Server issue, W56–60. doi: 10.1093/nar/gkr402. Davey, Norman E, Kim Van Roey, Robert J Weatheritt, Grischa Toedt, Bora Uyar, Brigitte Altenberg, Aidan Budd, Francesca Diella, Holger Dinkel, and Toby J Gibson (2012). “Attributes of short linear motifs.” In: Molecular Biosystems 8.1, pp. 268–281. doi: 10.1039/c1mb05231d. Dhanoa, Bajinder S, Tiziana Cogliati, Akhila G Satish, Elspeth A Bruford, and James S Friedman (2013). “Update on the Kelch-like (KLHL) gene family.” In: Human genomics 7, p. 13. doi: 10.1186/1479-7364-7-13. Dill, Ken A and Justin L MacCallum (2012). “The protein-folding problem, 50 years on.” In: Science 338.6110, pp. 1042–1046. doi: 10.1126/science.1219021. 247 Ding Yuan, Fang, Priyadarshan K Damle, Larisa Litovchick, Ronny Drapkin, and Steven R Grossman (2020). “CtBP determines ovarian cancer cell fate through repression of death receptors.” In: Cell death & disease 11.4, p. 286. doi: 10. 1038/s41419-020-2455-7. Doolittle, Russell F. (1995). “The Multiplicity of Domains in Proteins”. In: Annual Review of Biochemistry 64, pp. 287–314. doi: 10.1146/annurev.bi.64.070195. 001443. Dosztányi, Peter Csizmok Tompa, and Simon (2005). “IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on esti- mated energy content.” In: Bioinformatics 21.16, pp. 3433–3434. doi: 10.1093/ bioinformatics/bti541. Dosztányi, Zsuzsanna (2018). “Prediction of protein disorder based on IUPred.” In: Protein Science 27.1, pp. 331–340. doi: 10.1002/pro.3334. Dragulescu-Andrasi Chan, Carmel T, Abhijit De, Tarik F Massoud, and Sanjiv S Gambhir (2011). “Bioluminescence resonance energy transfer (BRET) imaging of protein-protein interactions within deep tissues of living subjects.” In: Pro- ceedings of the National Academy of Sciences of the United States of America 108.29, pp. 12060–12065. issn: 1091-6490. doi: 10.1073/pnas.1100923108. Dunker, A. Keith, Celeste J. Brown, and Zoran Obradovic (2002). “Identification and functions of usefully disordered proteins”. In: Advances in Protein Chemistry 62, pp. 25–49. doi: 10.1016/S0065-3233(02)62004-3. Dyson Wright, Peter E (2005). “Intrinsically unstructured proteins and their func- tions.” In: Nature Reviews. Molecular Cell Biology 6.3, pp. 197–208. doi: 10. 1038/nrm1589. Edwards and Nicolas Palopoli (2014). “Computational Prediction of Short Linear Motifs from Protein Sequences”. In: Computational Peptidology. Vol. 1268. Meth- ods in Molecular Biology. Humana Press, pp. 89–141. doi: 10.1007/978-1- 4939-2285-7_5. Felli, Isabella C. and Roberta Pierattelli (2015). Intrinsically Disordered Proteins Studied by NMR Spectroscopy. Springer. isbn: 978-3-319-20197-9. doi: 10.1007/ 978-3-319-20198-6. Fields and Song (1989). “A novel genetic system to detect protein–protein interac- tions”. In: Nature 340, pp. 245–246. doi: 10.1038/340245a0. Filograna De Tito, Stefano, Matteo Lo Monte, Rosario Oliva, Francesca Bruzzese, Maria Serena Roca, Antonella Zannetti, Adelaide Greco, Daniela Spano, In- maculada Ayala, Assunta Liberti, Luigi Petraccone, Nina Dathan, Giuliana Catara, Laura Schembri, Antonino Colanzi, Alfredo Budillon, Andrea Rosario Beccari, Pompea Del Vecchio, Alberto Luini, Daniela Corda, and Carmen Va- lente (2024). “Identification and characterization of a new potent inhibitor tar- 248 geting CtBP1/BARS in melanoma cells.” In: Journal of Experimental & Clinical Cancer Research 43.1, p. 137. doi: 10.1186/s13046-024-03044-5. Finn Mistry, Jaina, John Tate, Penny Coggill, Andreas Heger, Joanne E Pollington, O Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, and Alex Bateman (2010). “The Pfam protein families database.” In: Nucleic Acids Research 38.Database issue, pp. D211–22. doi: 10.1093/nar/gkp985. Finn, Robert D, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eber- hardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L L Sonnhammer, John Tate, and Marco Punta (2014). “Pfam: the protein families database.” In: Nucleic Acids Research 42.Database issue, pp. D222–30. doi: 10.1093/nar/gkt1223. Forbes Bindal, Nidhi, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare, Mingming Jia, Rebecca Shepherd, Kenric Leung, Andrew Menzies, Jon W Teague, Peter J Campbell, Michael R Stratton, and P Andrew Futreal (2011). “COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mu- tations in Cancer.” In: Nucleic Acids Research 39.Database issue, pp. D945–50. doi: 10.1093/nar/gkq929. Fouassier, L, C C Yun, J G Fitz, and R B Doctor (2000). “Evidence for ezrin-radixin- moesin-binding phosphoprotein 50 (EBP50) self-association through PDZ-PDZ interactions.” In: The Journal of Biological Chemistry 275.32, pp. 25039–25045. doi: 10.1074/jbc.C000092200. Fragoza, Robert, Jishnu Das, Shayne D Wierbowski, Jin Liang, Tina N Tran, Siqi Liang, Juan F Beltran, Christen A Rivera-Erick, Kaixiong Ye, Ting-Yi Wang, Li Yao, Matthew Mort, Peter D Stenson, David N Cooper, Xiaomu Wei, Alon Keinan, John C Schimenti, Andrew G Clark, and Haiyuan Yu (2019). “Extensive disruption of protein interactions by genetic variants across the allele frequency spectrum in human populations.” In: Nature Communications 10.1, p. 4141. doi: 10.1038/s41467-019-11959-3. Freedman, M. H. and M. Sela (1966). “Recovery of antigenic activity upon reoxi- dation of completely reduced polyalanyl rabbit immunoglobulin G”. In: J. Biol. Chem. 241.10, pp. 2383–2396. Geist Lee, Chop Yan, Joelle Morgan Strom, José de Jesús Naveja, and Katja Luck (2024). “Generation of a high confidence set of domain-domain interface types to guide protein complex structure predictions by AlphaFold.” In: Bioinformatics. doi: 10.1093/bioinformatics/btae482. Gilmore, T D (2006). “Introduction to NF-kappaB: players, pathways, perspectives.” In: Oncogene 25.51, pp. 6680–6684. doi: 10.1038/sj.onc.1209954. 249 Glover Williams, Lee (2004). “Interactions between BRCT repeats and phosphopro- teins: tangled up in two.” In: Trends in Biochemical Sciences 29.11, pp. 579–585. doi: 10.1016/j.tibs.2004.09.010. gnomAD (2024). Genome Aggregation Database (gnomAD). Accessed: 2024-09-05. Goh, Kwang-Il, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and Albert-László Barabási (2007). “The human disease network.” In: Proceedings of the National Academy of Sciences of the United States of America 104.21, pp. 8685–8690. issn: 0027-8424. doi: 10.1073/pnas.0701361104. Gouw, Marc, Hugo Sámano-Sánchez, Kim Van Roey, Francesca Diella, Toby J Gib- son, and Holger Dinkel (2017). “Exploring short linear motifs using the ELM database and tools.” In: Current Protocols in Bioinformatics 58, pp. 8.22.1– 8.22.35. doi: 10.1002/cpbi.26. Gouw, Hugo Sámano-Sánchez, Manjeet Kumar, András Zeke, Benjamin Lang, Benoit Bely, Lućıa B Chemes, Norman E Davey, Ziqi Deng, Francesca Diella, Clara-Marie Gürth, Ann-Kathrin Huber, Stefan Kleinsorg, Lara S Schlegel, Nicolás Palopoli, Kim V Roey, Brigitte Altenberg, Attila Reményi, Holger Dinkel, and Toby J Gibson (2018). “The eukaryotic linear motif resource - 2018 update.” In: Nucleic Acids Research 46.D1, pp. D428–D434. issn: 0305-1048. doi: 10.1093/nar/gkx1077. Goyet, Elise, Nathalie Bouquier, Vincent Ollendorff, and Julie Perroy (2016). “Fast and high resolution single-cell BRET imaging”. In: Scientific Reports 6, Article 28231. doi: 10.1038/srep28231. Grozinger, C M and S L Schreiber (2000). “Regulation of histone deacetylase 4 and 5 and transcriptional activity by 14-3-3-dependent cellular localization.” In: Proceedings of the National Academy of Sciences of the United States of America 97.14, pp. 7835–7840. doi: 10.1073/pnas.140199597. Grünberg, Raik, Julia V Burnier, Tony Ferrar, Violeta Beltran-Sastre, François Stricher, Almer M van der Sloot, Raquel Garcia-Olivas, Arrate Mallabiabarrena, Xavier Sanjuan, Timo Zimmermann, and Luis Serrano (2013). “Engineering of weak helper interactions for high-efficiency FRET probes”. In: Nature Methods 10.10, pp. 1021–1027. doi: 10.1038/nmeth.2625. Gupta, Vandana A and Alan H Beggs (2014). “Kelch proteins: emerging roles in skeletal muscle development and diseases.” In: Skeletal muscle [electronic re- source] 4, p. 11. doi: 10.1186/2044-5040-4-11. Hall Unch, James, Brock F Binkowski, Michael P Valley, Braeden L Butler, Monika G Wood, Paul Otto, Kristopher Zimmerman, Gediminas Vidugiris, Thomas Machleidt, Matthew B Robers, Hélène A Benink, Christopher T Eggers, Michael R Slater, Poncho L Meisenheimer, Dieter H Klaubert, Frank Fan, Lance P En- cell, and Keith V Wood (2012). “Engineered luciferase reporter from a deep sea 250 shrimp utilizing a novel imidazopyrazinone substrate.” In: ACS Chemical Biology 7.11, pp. 1848–1857. doi: 10.1021/cb3002478. Hamosh Scott, Alan F, Joanna S Amberger, Carol A Bocchini, and Victor A McKu- sick (2005). “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.” In: Nucleic Acids Research 33.Database issue, pp. D514–7. doi: 10.1093/nar/gki033. Han, J.-D. J., N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy, A. J. M. Walhout, M. E. Cusick, F. P. Roth, and M. Vidal (2004). Evidence for dynamically organized modularity in the yeast protein–protein in- teraction network. doi: 10.1038/nature02654. Harris, B Z and W A Lim (2001). “Mechanism and role of PDZ domains in signaling complex assembly.” In: Journal of Cell Science 114.Pt 18, pp. 3219–3231. doi: 10.1242/jcs.114.18.3219. Hayden, Matthew S and Sankar Ghosh (2008). “Shared principles in NF-kappaB signaling.” In: Cell 132.3, pp. 344–362. doi: 10.1016/j.cell.2008.01.020. Holmstrom, Erik D and David J Nesbitt (2016). “Biophysical Insights from Temperature-Dependent Single-Molecule Förster Resonance Energy Transfer.” In: Annual review of physical chemistry 67, pp. 441–465. doi: 10.1146/annurev- physchem-040215-112544. Hsu, Lih-Ching (2007). “Identification and functional characterization of a PP1- binding site in BRCA1.” In: Biochemical and Biophysical Research Communica- tions 360.2, pp. 507–512. doi: 10.1016/j.bbrc.2007.06.090. Huttlin, Edward L., Raphael J. Bruckner, Joao A. Paulo, Joe R. Cannon, Lily Ting, Kurt Baltier, Greg Colby, Fana Gebreab, Melanie P. Gygi, Hannah Parzen, John Szpyt, Stanley Tam, Gabriela Zarraga, Laura Pontano-Vaites, Sharan Swarup, Anne E. White, Devin K. Schweppe, Ramin Rad, Brian K. Erickson, Robert A. Obar, K. G. Guruharsha, Kejie Li, Spyros Artavanis-Tsakonas, Steven P. Gygi, and J. Wade Harper (2017). “Architecture of the human interactome defines protein communities and disease networks”. In: Nature 545.7655, pp. 505–509. issn: 0028-0836. doi: 10.1038/nature22366. Huttlin, Edward L., Richard J. Bruckner, Javier Navarrete-Perea, Jeffrey R. Cannon, Kevin Baltier, Fasil Gebreab, Martha P. Gygi, Austin Thornock, Genaro Zarraga, Shawn Tam, et al. (2021). “Dual proteome-scale networks reveal cell-specific remodeling of the human interactome”. In: Cell 184.11, 3022–3040.e28. doi: 10. 1016/j.cell.2021.04.011. Iakoucheva, Lilia M, Celeste J Brown, J David Lawson, Zoran Obradović, and A Keith Dunker (2002). “Intrinsic disorder in cell-signaling and cancer-associated proteins.” In: Journal of Molecular Biology 323.3, pp. 573–584. issn: 0022-2836. doi: 10.1016/s0022-2836(02)00969-5. 251 Idrees, Sobia and Keshav Raj Paudel (2024). “Proteome-wide assessment of hu- man interactome as a source of capturing domain-motif and domain-domain interactions.” In: Journal of cell communication and signaling 18.1, e12014. doi: 10.1002/ccs3.12014. Ingham Colwill, Karen, Caley Howard, Sabine Dettwiler, Caesar S H Lim, Joanna Yu, Kadija Hersi, Judith Raaijmakers, Gerald Gish, Geraldine Mbamalu, Lorne Taylor, Benny Yeung, Galina Vassilovski, Manish Amin, Fu Chen, Liudmila Matskova, Gösta Winberg, Ingemar Ernberg, Rune Linding, Paul O’donnell, Andrei Starostine, Walter Keller, Pavel Metalnikov, Chris Stark, and Tony Paw- son (2005). “WW domains provide a platform for the assembly of multipro- tein networks.” In: Molecular and Cellular Biology 25.16, pp. 7092–7106. doi: 10.1128/{MCB}.25.16.7092-7106.2005. Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Ž́ıdek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis (2021). “Highly accurate protein structure prediction with AlphaFold.” In: Nature 596.7873, pp. 583–589. issn: 0028-0836. doi: 10.1038/s41586-021-03819-2. Karczewski Francioli, Laurent C, Grace Tiao, Beryl B Cummings, Jessica Alföldi, Qingbo Wang, Ryan L Collins, Kristen M Laricchia, Andrea Ganna, Daniel P Birnbaum, Laura D Gauthier, Harrison Brand, Matthew Solomonson, Nicholas A Watts, Daniel Rhodes, Moriel Singer-Berk, Eleina M England, Eleanor G Seaby, Jack A Kosmicki, Raymond K Walters, Katherine Tashman, Yossi Far- joun, Eric Banks, Timothy Poterba, Arcturus Wang, Cotton Seed, Nicola Whif- fin, Jessica X Chong, Kaitlin E Samocha, Emma Pierce-Hoffman, Zachary Zap- pala, Anne H O’Donnell-Luria, Eric Vallabh Minikel, Ben Weisburd, Monkol Lek, James S Ware, Christopher Vittal, Irina M Armean, Louis Bergelson, Kris- tian Cibulskis, Kristen M Connolly, Miguel Covarrubias, Stacey Donnelly, Steven Ferriera, Stacey Gabriel, Jeff Gentry, Namrata Gupta, Thibault Jeandet, Diane Kaplan, Christopher Llanwarne, Ruchi Munshi, Sam Novod, Nikelle Petrillo, David Roazen, Valentin Ruano-Rubio, Andrea Saltzman, Molly Schleicher, Jose Soto, Kathleen Tibbetts, Charlotte Tolonen, Gordon Wade, Michael E Talkowski, Genome Aggregation Database Consortium, Benjamin M Neale, Mark J Daly, and Daniel G MacArthur (2020). “The mutational constraint spectrum quanti- fied from variation in 141,456 humans.” In: Nature 581.7809, pp. 434–443. issn: 0028-0836. doi: 10.1038/s41586-020-2308-7. 252 Kim, Jiho and Regis Grailhe (2016). “Nanoluciferase signal brightness using furi- mazine substrates opens bioluminescence resonance energy transfer to widefield microscopy”. In: Cytometry Part A. doi: 10.1002/cyto.a.22870. ￿ (2024). “Nanoluciferase signal brightness using furimazine substrates opens biolu- minescence resonance energy transfer to widefield microscopy”. In: Brief Report. Free Access. Kim, Mi-Sung, M Waseem Akhtar, Megumi Adachi, Melissa Mahgoub, Rhonda Bassel-Duby, Ege T Kavalali, Eric N Olson, and Lisa M Monteggia (2012). “An essential role for histone deacetylase 4 in synaptic plasticity and mem- ory formation.” In: The Journal of Neuroscience 32.32, pp. 10879–10886. doi: 10.1523/{JNEUROSCI}.2089-12.2012. Klug, Aaron (2010). “The discovery of zinc fingers and their applications in gene regulation and genome manipulation.” In: Annual Review of Biochemistry 79, pp. 213–231. doi: 10.1146/annurev-biochem-010909-095056. Kobayashi, Hiroyuki, Louis-Philippe Picard, Anne-Marie Schönegge, and Michel Bouvier (2019). “Bioluminescence resonance energy transfer-based imaging of protein-protein interactions in living cells.” In: Nature Protocols 14.4, pp. 1084– 1107. doi: 10.1038/s41596-019-0129-7. Koipally Georgopoulos, K (2000). “Ikaros interactions with CtBP reveal a repres- sion mechanism that is independent of histone deacetylase activity.” In: The Journal of Biological Chemistry 275.26, pp. 19594–19602. doi: 10.1074/jbc. M000254200. Koonin, E V (1996). “Pseudouridine synthases: four families of enzymes containing a putative uridine-binding motif also conserved in dUTPases and dCTP deami- nases.” In: Nucleic Acids Research 24.12, pp. 2411–2415. doi: 10.1093/nar/24. 12.2411. Kornau, H C, L T Schenker, M B Kennedy, and P H Seeburg (1995). “Domain inter- action between NMDA receptor subunits and the postsynaptic density protein PSD-95.” In: Science 269.5231, pp. 1737–1740. doi: 10.1126/science.7569905. Kumar, Manjeet, Sushama Michael, Jesús Alvarado-Valverde, András Zeke, Tamas Lazar, Juliana Glavina, Eszter Nagy-Kanta, Juan Mac Donagh, Zsofia E Kalman, Stefano Pascarelli, Nicolas Palopoli, László Dobson, Carmen Florencia Suarez, Kim Van Roey, Izabella Krystkowiak, Juan Esteban Griffin, Anurag Nagpal, Rajesh Bhardwaj, Francesca Diella, Bálint Mészáros, Kellie Dean, Norman E Davey, Rita Pancsa, Lućıa B Chemes, and Toby J Gibson (2024). “ELM-the Eu- karyotic Linear Motif resource-2024 update.” In: Nucleic Acids Research 52.D1, pp. D442–D455. doi: 10.1093/nar/gkad1058. Lacoste, Jessica, Marzieh Haghighi, Shahan Haider, Zhen-Yuan Lin, Dmitri Segal, Chloe Reno, Wesley Wei Qian, Xueting Xiong, Hamdah Shafqat-Abbasi, Pearl V Ryder, Rebecca Senft, Beth A Cimini, Frederick P Roth, Michael Calderwood, 253 David Hill, Marc Vidal, S Stephen Yi, Nidhi Sahni, Jian Peng, Anne-Claude Gin- gras, Shantanu Singh, Anne E Carpenter, and Mikko Taipale (2023). “Pervasive mislocalization of pathogenic coding variants underlying human disorders.” In: BioRxiv. doi: 10.1101/2023.09.05.556368. Landrum, Melissa J, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey Hoover, Wonhee Jang, Kenneth Katz, Michael Ovetsky, George Riley, Amanjeev Sethi, Ray Tully, Ricardo Villamarin-Salomon, Wendy Rubinstein, and Donna R Maglott (2016). “ClinVar: public archive of interpretations of clinically relevant variants.” In: Nucleic Acids Research 44.D1, pp. D862–8. doi: 10.1093/nar/ gkv1222. Lee (2010). “PDZ domains and their binding partners: structure, specificity, and modification.” In: Cell Communication and Signaling 8, p. 8. doi: 10.1186/ 1478-{811X}-8-8. Lee, Hubrich, Varga, Christian Schäfer, Mareen Welzel, Eric Schumbera, Milena Djo- kic, Joelle M Strom, Jonas Schönfeld, Johanna L Geist, Feyza Polat, Toby J Gib- son, Claudia Isabelle Keller Valsecchi, Manjeet Kumar, Ora Schueler-Furman, and Katja Luck (2024). “Systematic discovery of protein interaction interfaces using AlphaFold and experimental validation.” In: Molecular Systems Biology 20.2, pp. 75–97. issn: 1744-4292. doi: 10.1038/s44320-023-00005-6. Lee, Olzmann, Lih-Shen Chin, and Lian Li (2011). “Mutations associated with Charcot-Marie-Tooth disease cause SIMPLE protein mislocalization and degra- dation by the proteasome and aggresome-autophagy pathways.” In: Journal of Cell Science 124.Pt 19, pp. 3319–3331. doi: 10.1242/jcs.087114. Lek, Monkol, Konrad J Karczewski, Eric V Minikel, Kaitlin E Samocha, Eric Banks, Timothy Fennell, Anne H O’Donnell-Luria, James S Ware, Andrew J Hill, Beryl B Cummings, Taru Tukiainen, Daniel P Birnbaum, Jack A Kosmicki, Laramie E Duncan, Karol Estrada, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Joanne Berghout, David N Cooper, Nicole Deflaux, Mark DePristo, Ron Do, Jason Flannick, Menachem Fromer, Laura Gauthier, Jackie Goldstein, Namrata Gupta, Daniel Howrigan, Adam Kiezun, Mitja I Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M Peloso, Ryan Poplin, Manuel A Ri- vas, Valentin Ruano-Rubio, Samuel A Rose, Douglas M Ruderfer, Khalid Shakir, Peter D Stenson, Christine Stevens, Brett P Thomas, Grace Tiao, Maria T Tusie- Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David M Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C Florez, Stacey B Gabriel, Gad Getz, Stephen J Glatt, Christina M Hult- man, Sekar Kathiresan, Markku Laakso, Steven McCarroll, Mark I McCarthy, Dermot McGovern, Ruth McPherson, Benjamin M Neale, Aarno Palotie, Shaun M Purcell, Danish Saleheen, Jeremiah M Scharf, Pamela Sklar, Patrick F Sul- 254 livan, Jaakko Tuomilehto, Ming T Tsuang, Hugh C Watkins, James G Wil- son, Mark J Daly, Daniel G MacArthur, and Exome Aggregation Consortium (2016). “Analysis of protein-coding genetic variation in 60,706 humans.” In: Na- ture 536.7616, pp. 285–291. issn: 0028-0836. doi: 10.1038/nature19057. Letunic, Ivica, Supriya Khedkar, and Peer Bork (2021). “SMART: recent up- dates, new developments and status in 2020.” In: Nucleic Acids Research 49.D1, pp. D458–D460. doi: 10.1093/nar/gkaa937. Li Wang, Fei, Qiao Wang, Na Zhang, Jumei Zheng, Maiqing Zheng, Ranran Liu, Huanxian Cui, Jie Wen, and Guiping Zhao (2020). “SPOP promotes ubiquiti- nation and degradation of MyD88 to suppress the innate immune response.” In: PLoS Pathogens 16.5, e1008188. doi: 10.1371/journal.ppat.1008188. Lievens, Peelman, De Bosscher, Lemmens, and Jan Tavernier (2011). “MAPPIT: a protein interaction toolbox built on insights in cytokine receptor signaling”. In: Cytokine Growth Factor Reviews 22.5-6, pp. 321–329. doi: 10.1016/j. cytogfr.2011.11.001. Lin Smith, Edwin R, Hidehisa Takahashi, Ka Chun Lai, Skylar Martin-Brown, Lau- rence Florens, Michael P Washburn, Joan W Conaway, Ronald C Conaway, and Ali Shilatifard (2010). “AFF4, a component of the ELL/P-TEFb elongation complex and a shared subunit of MLL chimeras, can link transcription elonga- tion to leukemia.” In: Molecular Cell 37.3, pp. 429–437. issn: 1097-4164. doi: 10.1016/j.molcel.2010.01.026. Livesey, Benjamin J and Joseph A Marsh (2022). “Interpreting protein variant ef- fects with computational predictors and deep mutational scanning.” In: Disease Models & Mechanisms 15.6. doi: 10.1242/dmm.049510. Luck, Katja, Sebastian Charbonnier, and Gilles Travé (2012). “The emerging con- tribution of sequence context to the specificity of protein interactions mediated by PDZ domains”. In: FEBS Letters 586.17, pp. 2648–2661. doi: 10.1016/j. febslet.2012.03.056. Luck, Katja, Dae-Kyum Kim, Luke Lambourne, Kerstin Spirohn, Bridget E Begg, Wenting Bian, Ruth Brignall, Tiziana Cafarelli, Francisco J Campos-Laborie, Benoit Charloteaux, Dongsic Choi, Atina G Coté, Meaghan Daley, Steven Deim- ling, Alice Desbuleux, Amélie Dricot, Marinella Gebbia, Madeleine F Hardy, Nishka Kishore, Jennifer J Knapp, István A Kovács, Irma Lemmens, Miles W Mee, Joseph C Mellor, Carl Pollis, Carles Pons, Aaron D Richardson, Sadie Schlabach, Bridget Teeking, Anupama Yadav, Mariana Babor, Dawit Balcha, Omer Basha, Christian Bowman-Colin, Suet-Feung Chin, Soon Gang Choi, Clau- dia Colabella, Georges Coppin, Cassandra D’Amata, David De Ridder, Steffi De Rouck, Miquel Duran-Frigola, Hanane Ennajdaoui, Florian Goebels, Liana Goehring, Anjali Gopal, Ghazal Haddad, Elodie Hatchi, Mohamed Helmy, Yves Jacob, Yoseph Kassa, Serena Landini, Roujia Li, Natascha van Lieshout, An- 255 drew MacWilliams, Dylan Markey, Joseph N Paulson, Sudharshan Rangarajan, John Rasla, Ashyad Rayhan, Thomas Rolland, Adriana San-Miguel, Yun Shen, Dayag Sheykhkarimli, Gloria M Sheynkman, Eyal Simonovsky, Murat Taşan, Alexander Tejeda, Vincent Tropepe, Jean-Claude Twizere, Yang Wang, Robert J Weatheritt, Jochen Weile, Yu Xia, Xinping Yang, Esti Yeger-Lotem, Quan Zhong, Patrick Aloy, Gary D Bader, Javier De Las Rivas, Suzanne Gaudet, Tong Hao, Janusz Rak, Jan Tavernier, David E Hill, Marc Vidal, Frederick P Roth, and Michael A Calderwood (2020). “A reference map of the human binary protein interactome.” In: Nature 580.7803, pp. 402–408. issn: 0028-0836. doi: 10.1038/s41586-020-2188-x. Ludes-Meyers Kil, Hyunsuk, Andrzej K Bednarek, Jeff Drake, Mark T Bedford, and C Marcelo Aldaz (2004). “WWOX binds the specific proline-rich ligand PPXY: identification of candidate interacting proteins.” In: Oncogene 23.29, pp. 5049– 5055. doi: 10.1038/sj.onc.1207680. Luo Lin, Chengqi, Erin Guest, Alexander S Garrett, Nima Mohaghegh, Selene Swan- son, Stacy Marshall, Laurence Florens, Michael P Washburn, and Ali Shilatifard (2012). “The super elongation complex family of RNA polymerase II elongation factors: gene target specificity and transcriptional output.” In: Molecular and Cellular Biology 32.13, pp. 2608–2617. doi: 10.1128/{MCB}.00182-12. Luo, X., Q. He, Y. Huang, and M. S. Sheikh (2005). “Cloning and characteriza- tion of a p53 and DNA damage down-regulated gene PIQ that codes for a novel calmodulin-binding IQ motif protein and is up-regulated in gastrointestinal can- cers”. In: Cancer Research 65, pp. 10725–10733. Martino, Elisa, Sara Chiarugi, Francesco Margheriti, and Gianpiero Garau (2021). “Mapping, structure and modulation of PPI”. In: Frontiers in Chemistry 9, p. 718405. doi: 10.3389/fchem.2021.718405. Melhuish Wotton, D (2000). “The interaction of the carboxyl terminus-binding pro- tein with the Smad corepressor TGIF is disrupted by a holoprosencephaly muta- tion in TGIF.” In: The Journal of Biological Chemistry 275.50, pp. 39762–39766. doi: 10.1074/jbc.C000416200. Mészáros, Bálint, István Simon, and Zsuzsanna Dosztányi (2009). “Prediction of protein binding regions in disordered proteins.” In: PLoS Computational Biology 5.5, e1000376. doi: 10.1371/journal.pcbi.1000376. Meyer Kirchner, Marieluise, Bora Uyar, Jing-Yuan Cheng, Giulia Russo, Luis R Hernandez-Miranda, Anna Szymborska, Henrik Zauber, Ina-Maria Rudolph, Thomas E Willnow, Altuna Akalin, Volker Haucke, Holger Gerhardt, Carmen Birchmeier, Ralf Kühn, Michael Krauss, Sebastian Diecke, Juan M Pascual, and Matthias Selbach (2018). “Mutations in disordered regions can cause disease by creating dileucine motifs.” In: Cell 175.1, 239–253.e17. issn: 00928674. doi: 10.1016/j.cell.2018.08.019. 256 Mihalič, Filip, Leandro Simonetti, Girolamo Giudice, Marie Rubin Sander, Richard Lindqvist, Marie Berit Akpiroro Peters, Caroline Benz, Eszter Kassa, Dilip Badgujar, Raviteja Inturi, Muhammad Ali, Izabella Krystkowiak, Ahmed Sayadi, Eva Andersson, Hanna Aronsson, Ola Söderberg, Doreen Dobritzsch, Evangelia Petsalaki, Anna K Överby, Per Jemth, Norman E Davey, and Ylva Ivarsson (2023). “Large-scale phage-based screening reveals extensive pan-viral mimicry of host short linear motifs.” In: Nature Communications 14.1, p. 2409. doi: 10.1038/s41467-023-38015-5. Mosca, Roberto, Arnaud Céol, Amelie Stein, Roger Olivella, and Patrick Aloy (2014). “3did: a catalog of domain-based interactions of known three-dimensional structure”. In: Nucleic Acids Research 42.Database issue, pp. D374–D379. doi: 10.1093/nar/gkt887. eprint: 2013Sep29. Nesta, Alex V, Denisse Tafur, and Christine R Beck (2021). “Hotspots of human mutation.” In: Trends in Genetics 37.8, pp. 717–729. issn: 01689525. doi: 10. 1016/j.tig.2020.10.003. Nooren Thornton, Janet M. (2003). “Diversity of protein–protein interactions”. In: The EMBO Journal 22.14, pp. 3486–3492. doi: 10.1093/emboj/cdg359. Northrop, J. H. (1930). “CRYSTALLINE PEPSIN: I. ISOLATION AND TESTS OF PURITY”. In: The Journal of General Physiology 13.6, pp. 739–766. doi: 10.1085/jgp.13.6.739. Oldfield, Christopher J and A Keith Dunker (2014). “Intrinsically disordered proteins and intrinsically disordered protein regions.” In: Annual Review of Biochemistry 83, pp. 553–584. doi: 10.1146/annurev-biochem-072711-164947. Oliver Bitoun, Emmanuelle, Joanne Clark, Emma L Jones, and Kay E Davies (2004). “Mediation of Af4 protein function in the cerebellum by Siah proteins.” In: Pro- ceedings of the National Academy of Sciences of the United States of America 101.41, pp. 14901–14906. doi: 10.1073/pnas.0406196101. Oxley Anthis, Nicholas J, Edward D Lowe, Ioannis Vakonakis, Iain D Campbell, and Kate L Wegener (2008). “An integrin phosphorylation switch: the effect of beta3 integrin tail phosphorylation on Dok1 and talin binding.” In: The Journal of Biological Chemistry 283.9, pp. 5420–5426. doi: 10.1074/jbc.M709435200. Paysan-Lafosse, Typhaine, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, Peer Bork, Alan Bridge, Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunić, Aron Marchler-Bauer, Huaiyu Mi, Darren A Natale, Christine A Orengo, Arun P Pandurangan, Cather- ine Rivoire, Christian J A Sigrist, Ian Sillitoe, Narmada Thanki, Paul D Thomas, Silvio C E Tosatto, Cathy H Wu, and Alex Bateman (2023). “InterPro in 2022.” In: Nucleic Acids Research 51.D1, pp. D418–D427. doi: 10.1093/nar/gkac993. 257 Peng, Zhenling, Marcin J Mizianty, Bin Xue, Lukasz Kurgan, and Vladimir N Uver- sky (2012). “More than just tails: intrinsic disorder in histone proteins.” In: Molecular Biosystems 8.7, pp. 1886–1901. doi: 10.1039/c2mb25102g. Pennington, K L, T Y Chan, M P Torres, and J L Andersen (2018). “The dy- namic and stress-adaptive signaling hub of 14-3-3: emerging mechanisms of reg- ulation and context-dependent protein-protein interactions.” In: Oncogene 37.42, pp. 5587–5604. doi: 10.1038/s41388-018-0348-3. Petsalaki Stark, Alexander, Eduardo Garćıa-Urdiales, and Robert B Russell (2009). “Accurate prediction of peptide binding sites on protein surfaces.” In: PLoS Com- putational Biology 5.3, e1000335. doi: 10.1371/journal.pcbi.1000335. Pfleger Seeber, Ruth M and Karin A Eidne (2006). “Bioluminescence resonance en- ergy transfer (BRET) for the real-time detection of protein-protein interactions.” In: Nature Protocols 1.1, pp. 337–345. doi: 10.1038/nprot.2006.52. Pierce, Michael M., C. S. Raman, and Barry T. Nall (1999). “Isothermal Titration Calorimetry of Protein–Protein Interactions”. In: Methods 19.2, pp. 213–221. doi: 10.1016/S1046-2023(99)00009-0. Puntervoll Linding, Rune, Christine Gemünd, Sophie Chabanis-Davidson, Morten Mattingsdal, Scott Cameron, David M A Martin, Gabriele Ausiello, Barbara Brannetti, Anna Costantini, Fabrizio Ferrè, Vincenza Maselli, Allegra Via, Gi- anni Cesareni, Francesca Diella, Giulio Superti-Furga, Lucjan Wyrwicz, Chenna Ramu, Caroline McGuigan, Rambabu Gudavalli, Ivica Letunic, Peer Bork, Leszek Rychlewski, Bernhard Küster, Manuela Helmer-Citterich, William N Hunter, Rein Aasland, and Toby J Gibson (2003). “ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins.” In: Nu- cleic Acids Research 31.13, pp. 3625–3630. doi: 10.1093/nar/gkg545. Ramirez-Martinez, Andres, Bercin Kutluk Cenik, Svetlana Bezprozvannaya, Beibei Chen, Rhonda Bassel-Duby, Ning Liu, and Eric N Olson (2017). “KLHL41 sta- bilizes skeletal muscle sarcomeres by nonproteolytic ubiquitination.” In: eLife 6. doi: 10.7554/{eLife}.26439. Rodŕıguez, J A and B R Henderson (2000). “Identification of a functional nuclear export sequence in BRCA1.” In: The Journal of Biological Chemistry 275.49, pp. 38589–38596. doi: 10.1074/jbc.M003851200. Rolland, Thomas, Murat Taşan, Benoit Charloteaux, Samuel J Pevzner, Quan Zhong, Nidhi Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca, Atanas Kamburov, Susan D Ghiassian, Xinping Yang, Lila Ghamsari, Dawit Balcha, Bridget E Begg, Pascal Braun, Marc Brehme, Martin P Broly, Anne- Ruxandra Carvunis, Dan Convery-Zupan, Roser Corominas, Jasmin Coulombe- Huntington, Elizabeth Dann, Matija Dreze, Amélie Dricot, Changyu Fan, Eric Franzosa, Fana Gebreab, Bryan J Gutierrez, Madeleine F Hardy, Mike Jin, Shuli Kang, Ruth Kiros, Guan Ning Lin, Katja Luck, Andrew MacWilliams, Jörg 258 Menche, Ryan R Murray, Alexandre Palagi, Matthew M Poulin, Xavier Ram- bout, John Rasla, Patrick Reichert, Viviana Romero, Elien Ruyssinck, Julie M Sahalie, Annemarie Scholz, Akash A Shah, Amitabh Sharma, Yun Shen, Kerstin Spirohn, Stanley Tam, Alexander O Tejeda, Shelly A Wanamaker, Jean-Claude Twizere, Kerwin Vega, Jennifer Walsh, Michael E Cusick, Yu Xia, Albert-László Barabási, Lilia M Iakoucheva, Patrick Aloy, Javier De Las Rivas, Jan Tavernier, Michael A Calderwood, David E Hill, Tong Hao, Frederick P Roth, and Marc Vidal (2014). “A proteome-scale map of the human interactome network.” In: Cell 159.5, pp. 1212–1226. doi: 10.1016/j.cell.2014.10.050. Sahni, Nidhi, Song Yi, Mikko Taipale, Juan I Fuxman Bass, Jasmin Coulombe- Huntington, Fan Yang, Jian Peng, Jochen Weile, Georgios I Karras, Yang Wang, István A Kovács, Atanas Kamburov, Irina Krykbaeva, Mandy H Lam, George Tucker, Vikram Khurana, Amitabh Sharma, Yang-Yu Liu, Nozomu Yachie, Quan Zhong, Yun Shen, Alexandre Palagi, Adriana San-Miguel, Changyu Fan, Dawit Balcha, Amelie Dricot, Daniel M Jordan, Jennifer M Walsh, Akash A Shah, Xin- ping Yang, Ani K Stoyanova, Alex Leighton, Michael A Calderwood, Yves Jacob, Michael E Cusick, Kourosh Salehi-Ashtiani, Luke J Whitesell, Shamil Sunyaev, Bonnie Berger, Albert-László Barabási, Benoit Charloteaux, David E Hill, Tong Hao, Frederick P Roth, Yu Xia, Albertha J M Walhout, Susan Lindquist, and Marc Vidal (2015). “Widespread macromolecular interaction perturbations in human genetic disorders.” In: Cell 161.3, pp. 647–660. doi: 10.1016/j.cell. 2015.04.013. Sahni, Nidhi, Song Yi, Quan Zhong, Noor Jailkhani, Benoit Charloteaux, Michael E Cusick, and Marc Vidal (2013). “Edgotype: a fundamental link between genotype and phenotype.” In: Current Opinion in Genetics & Development 23.6, pp. 649– 657. doi: 10.1016/j.gde.2013.11.002. Santelli Leone, Marilisa, Chenlong Li, Toru Fukushima, Nicholas E Preece, Arthur J Olson, Kathryn R Ely, John C Reed, Maurizio Pellecchia, Robert C Liddington, and Shu-ichi Matsuzawa (2005). “Structural analysis of Siah1-Siah-interacting protein interactions and insights into the assembly of an E3 ligase multiprotein complex.” In: The Journal of Biological Chemistry 280.40, pp. 34278–34287. doi: 10.1074/jbc.M506707200. Schreiber, G, G Haran, and H-X Zhou (2009). “Fundamental aspects of protein- protein association kinetics.” In: Chemical Reviews 109.3, pp. 839–860. doi: 10. 1021/cr800373w. Schultz, J., F. Milpetz, P. Bork, and C.P. Ponting (1998). “SMART, a simple mod- ular architecture research tool: identification of signaling domains”. In: Proceed- ings of the National Academy of Sciences U.S.A. 95.11, pp. 5857–5864. doi: 10.1073/pnas.95.11.5857. 259 Sekar, Rajesh Babu and Ammasi Periasamy (2003). “Fluorescence resonance energy transfer (FRET) microscopy imaging of live cell protein localizations”. In: Journal of Cell Biology 160.5, pp. 629–633. doi: 10.1083/jcb.200210140. Shaner Lambert, Gerard G., Andrew Chammas, Yuhui Ni, Paula J. Cranfill, Michelle A. Baird, Brittney R. Sell, John R. Allen, Richard N. Day, Maria Israelsson, Michael W. Davidson, and Jiwu Wang (2013). “A bright monomeric green flu- orescent protein derived from Branchiostoma lanceolatum”. In: Nature Methods 10, pp. 407–409. doi: 10.1038/nmeth.2413. Starita, Lea M, Muhtadi M Islam, Tapahsama Banerjee, Aleksandra I Adamovich, Justin Gullingsrud, Stanley Fields, Jay Shendure, and Jeffrey D Parvin (2018). “A Multiplex Homology-Directed DNA Repair Assay Reveals the Impact of More Than 1,000 BRCA1 Missense Substitution Variants on Protein Function.” In: American Journal of Human Genetics 103.4, pp. 498–508. issn: 00029297. doi: 10.1016/j.ajhg.2018.07.016. Stogios, Peter J and Gilbert G Privé (2004). “The BACK domain in BTB-kelch proteins.” In: Trends in Biochemical Sciences 29.12, pp. 634–637. doi: 10.1016/ j.tibs.2004.10.003. Sunyaev, S R, F Eisenhaber, I V Rodchenkov, B Eisenhaber, V G Tumanyan, and E N Kuznetsov (1999). “PSIC: profile extraction from sequence alignments with position-specific counts of independent observations.” In: Protein Engineering 12.5, pp. 387–394. doi: 10.1093/protein/12.5.387. Tadokoro Shattil, Sanford J, Koji Eto, Vera Tai, Robert C Liddington, Jose M de Pereda, Mark H Ginsberg, and David A Calderwood (2003). “Talin binding to integrin beta tails: a final common step in integrin activation.” In: Science 302.5642, pp. 103–106. issn: 1095-9203. doi: 10.1126/science.1086652. Taniguchi, Koji and Michael Karin (2018). “NF-, inflammation, immunity and can- cer: coming of age.” In: Nature Reviews. Immunology 18.5, pp. 309–324. doi: 10.1038/nri.2017.142. Tompa, Peter (2002). “Intrinsically unstructured proteins.” In: Trends in Biochem- ical Sciences 27.10, pp. 527–533. doi: 10.1016/s0968-0004(02)02169-2. ￿ (2011). “Unstructural biology coming of age.” In: Current Opinion in Structural Biology 21.3, pp. 419–425. doi: 10.1016/j.sbi.2011.03.012. ￿ (2012). “Intrinsically disordered proteins: a 10-year recap”. In: Trends in Bio- chemical Sciences 37.12. Available at: ptompa@vub.ac.be, pp. 509–516. doi: 10.1016/j.tibs.2012.08.009. Tompa, Peter, Norman E Davey, Toby J Gibson, and M Madan Babu (2014). “A mil- lion peptide motifs for the molecular biologist.” In: Molecular Cell 55.2, pp. 161– 169. doi: 10.1016/j.molcel.2014.05.032. Trepte Secker, Christopher, Soon Gang Choi, Julien Olivet, Eduardo Silva Ramos, Patricia Cassonnet, Sabrina Golusik, Martina Zenkner, Stephanie Beetz, Marcel 260 Sperling, Yang Wang, Tong Hao, Kerstin Spirohn, Jean-Claude Twizere, Michael A. Calderwood, David E. Hill, Yves Jacob, Marc Vidal, and Erich E. Wanker (2021). “A quantitative mapping approach to identify direct interactions within complexomes”. In: BioRxiv. doi: 10.1101/2021.08.25.457734. Trepte, Kruse, Kostova, Hoffmann, Buntru, Tempelmeier, Secker, Diez, Schulz, Klockmeier, Zenkner, Golusik, Rau, Schnoegl, Garner, and Erich Wanker (2018). “LuTHy: a double-readout bioluminescence-based two-hybrid technology for quantitative mapping of protein-protein interactions in mammalian cells.” In: Molecular Systems Biology 14.7, e8071. doi: 10.15252/msb.20178071. Uversky (2014). Intrinsically Disordered Proteins. Switzerland: Springer Interna- tional Publishing, pp. XV, 61. isbn: 978-3-319-08920-1. Uversky, Christopher J Oldfield, and A Keith Dunker (2005). “Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling.” In: Journal of Molecular Recognition 18.5, pp. 343–384. doi: 10.1002/jmr.747. Uyar, Bora, Robert J Weatheritt, Holger Dinkel, Norman E Davey, and Toby J Gib- son (2014). “Proteome-wide analysis of human disease mutations in short linear motifs: neglected players in cancer?” In: Molecular Biosystems 10.10, pp. 2626– 2642. doi: 10.1039/c4mb00290c. Valente Luini, Alberto and Daniela Corda (2013). “Components of the CtBP1/BARS-dependent fission machinery.” In: Histochemistry and Cell Biology 140.4, pp. 407–421. doi: 10.1007/s00418-013-1138-1. Van Roey, Kim, Toby J Gibson, and Norman E Davey (2012). “Motif switches: decision-making in cell regulation.” In: Current Opinion in Structural Biology 22.3, pp. 378–385. doi: 10.1016/j.sbi.2012.03.004. Van Roey, Kim, Bora Uyar, Robert J Weatheritt, Holger Dinkel, Markus Seiler, Aidan Budd, Toby J Gibson, and Norman E Davey (2014). “Short linear motifs: ubiquitous and functionally diverse protein interaction modules directing cell reg- ulation.” In: Chemical Reviews 114.13, pp. 6733–6778. doi: 10.1021/cr400585q. Velthuis, Aartjan J W te, Philippe A Sakalis, Donald A Fowler, and Christoph P Bagowski (2011). “Genome-wide analysis of PDZ domain binding reveals inherent functional overlap within the PDZ interaction network.” In: Plos One 6.1, e16047. doi: 10.1371/journal.pone.0016047. Vidal, Marc, Michael E Cusick, and Albert-László Barabási (2011). “Interactome networks and human disease.” In: Cell 144.6, pp. 986–998. issn: 1097-4172. doi: 10.1016/j.cell.2011.02.016. Visscher, Peter M, Matthew A Brown, Mark I McCarthy, and Jian Yang (2012). “Five years of GWAS discovery.” In: American Journal of Human Genetics 90.1, pp. 7–24. doi: 10.1016/j.ajhg.2011.11.029. 261 Vogel, Christine, Carlo Berzuini, Matthew Bashton, Julian Gough, and Sarah A. Teichmann (Year). “Supra-domains: Evolutionary units larger than single protein domains”. In: Journal Name Volume.Issue, pages. doi: 10.XXXX/XXXXXX. Vogel, Steven S, Christopher Thaler, and Srinagesh V Koushik (2006). “Fanciful FRET”. In: Science’s STKE 2006.331, re2. doi: 10.1126/stke.3312006re2. Wakeling, Emma, Meriel McEntagart, Michael Bruccoleri, Charles Shaw-Smith, Karen L Stals, Matthew Wakeling, Angela Barnicoat, Clare Beesley, DDD Study, Andrea K Hanson-Kahn, Mary Kukolich, David A Stevenson, Philippe M Campeau, Sian Ellard, Sarah H Elsea, Xiang-Jiao Yang, and Richard C Caswell (2021). “Missense substitutions at a conserved 14-3-3 binding site in HDAC4 cause a novel intellectual disability syndrome.” In: HGG advances 2.1, p. 100015. doi: 10.1016/j.xhgg.2020.100015. Wang, Jia Chen, and Mingjie Zhang (2010). “Extensions of PDZ domains as im- portant structural and functional elements.” In: Protein & cell 1.8, pp. 737–751. doi: 10.1007/s13238-010-0099-6. Wang, Jiyao, Farideh Chitsaz, Myra K Derbyshire, Noreen R Gonzales, Marc Gwadz, Shennan Lu, Gabriele H Marchler, James S Song, Narmada Thanki, Roxanne A Yamashita, Mingzhang Yang, Dachuan Zhang, Chanjuan Zheng, Christopher J Lanczycki, and Aron Marchler-Bauer (2023). “The conserved domain database in 2023.” In: Nucleic Acids Research 51.D1, pp. D384–D388. doi: 10.1093/nar/ gkac1096. Wang Kruhlak, M J, J Wu, N R Bertos, M Vezmar, B I Posner, D P Bazett-Jones, and X J Yang (2000). “Regulation of histone deacetylase 4 by binding of 14- 3-3 proteins.” In: Molecular and Cellular Biology 20.18, pp. 6904–6912. doi: 10.1128/{MCB}.20.18.6904-6912.2000. Weatheritt, Robert J and Toby J Gibson (2012). “Linear motifs: lost in (pre)translation.” In: Trends in Biochemical Sciences 37.8, pp. 333–341. doi: 10.1016/j.tibs.2012.05.001. Wegener Partridge, Anthony W, Jaewon Han, Andrew R Pickford, Robert C Lid- dington, Mark H Ginsberg, and Iain D Campbell (2007). “Structural basis of integrin activation by talin.” In: Cell 128.1, pp. 171–182. issn: 0092-8674. doi: 10.1016/j.cell.2006.10.048. Wierbowski, Shayne D., Robert Fragoza, Siqi Liang, and Haiyuan Yu (2018). “Ex- tracting complementary insights from molecular phenotypes for prioritization of disease-associated mutations”. In: Current Opinion in Systems Biology 11, pp. 107–116. issn: 24523100. doi: 10.1016/j.coisb.2018.09.006. Williams, R S, R Green, and J N Glover (2001). “Crystal structure of the BRCT repeat region from the breast cancer-associated protein BRCA1.” In: Nature Structural Biology 8.10, pp. 838–842. issn: 1072-8368. doi: 10.1038/nsb1001- 838. 262 Wilson, Carter J, Wing-Yiu Choy, and Mikko Karttunen (2022). “Alphafold2: A role for disordered protein/region prediction?” In: International Journal of Molecular Sciences 23.9, p. 4591. doi: 10.3390/ijms23094591. Wright, Dyson (1999). “Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm.” In: Journal of Molecular Biology 293.2, pp. 321– 331. doi: 10.1006/jmbi.1999.3110. ￿ (2015). “Intrinsically disordered proteins in cellular signalling and regulation.” In: Nature Reviews. Molecular Cell Biology 16.1, pp. 18–29. doi: 10.1038/nrm3920. Xu Piston, D W and C H Johnson (1999). “A bioluminescence resonance energy transfer (BRET) system: application to interacting circadian clock proteins.” In: Proceedings of the National Academy of Sciences of the United States of America 96.1, pp. 151–156. doi: 10.1073/pnas.96.1.151. Yuen, Michaela and Coen A C Ottenheijm (2020). “Nebulin: big protein with big responsibilities.” In: Journal of muscle research and cell motility 41.1, pp. 103– 124. doi: 10.1007/s10974-019-09565-3. Zhang, Yingnan, Brent A Appleton, Christian Wiesmann, Ted Lau, Mike Costa, Rami N Hannoush, and Sachdev S Sidhu (2009). “Inhibition of Wnt signaling by Dishevelled PDZ peptides.” In: Nature Chemical Biology 5.4, pp. 217–219. doi: 10.1038/nchembio.152. Zhong, Quan, Nicolas Simonis, Qian-Ru Li, Benoit Charloteaux, Fabien Heuze, Niels Klitgord, Stanley Tam, Haiyuan Yu, Kavitha Venkatesan, Danny Mou, Venus Swearingen, Muhammed A Yildirim, Han Yan, Amélie Dricot, David Szeto, Chenwei Lin, Tong Hao, Changyu Fan, Stuart Milstein, Denis Dupuy, Robert Brasseur, David E Hill, Michael E Cusick, and Marc Vidal (2009). “Edgetic per- turbation models of human inherited disorders.” In: Molecular Systems Biology 5, p. 321. doi: 10.1038/msb.2009.80. Zhou, Huan-Xiang (2012). “Intrinsic disorder: signaling via highly specific but short- lived association.” In: Trends in Biochemical Sciences 37.2, pp. 43–48. doi: 10. 1016/j.tibs.2011.11.002. 263 Dalmira Hubrich +49 157 34517760 / dalmiramer@gmail.com / Mainz, Germany / LinkedIn / GitHub Profile As a Systems Biologist with a focus on protein networks and interfaces, I gained a solid background in systematic experimental biology during my PhD, where I also began learning computational skills, including Python and SQL. While my primary expertise lies in experimental techniques, I am now expanding into bioinformatics and computational biology, aiming to work more extensively with biological data. I also have a growing interest in artificial intelligence and its applications in biological research, with a focus on enhancing my computational skills. Professional Experience Researcher December 2020-present Institute of Molecular Biology, Germany ● Established and adapted several techniques (e.g., cloning, site-directed mutagenesis, BRET assay, bioluminescent imaging) in the lab. ● Curated, extracted, and visualized diverse biological datasets required for my study. ● Provided experimental data and visualization support to PhD students and colleagues. ● Delivered results and contributed to collaborations across multiple projects. Junior Data Scientist (Part-time) April 2023 - August 2023 Be Factory UG ● Curated, cleaned, and processed large datasets. ● Developed a tool for feature extraction and trained a machine learning model to evaluate products based on score results. ● Built entity relationships model in SQL to manage data more effectively. ● Automated data management processes to streamline workflows. Education ● Doctor of Philosophy in Life Sciences|Johannes Gutenber University, Germany|December 2020 - present ● Master in Protein Enginnering and Biochemistry| Okinawa Institute of Sci & Tech| August 2017-2019 ● Bachelor of Biological Sciences|Nazarbayev University| September 2011-August 2016 Skills ● Proficient in conducting systematic experimental assays to detect protein-protein interactions (PPIs), including BRET and ITC. ● Experienced with high-content screening equipment, such as Opera Phenix, for live-cell imaging and working with software like Harmony for comprehensive image analysis. ● Trained in Python OOP and relevant packages, including pandas, scipy, and numpy for data management and analysis; matplotlib and seaborn for data plotting and visualization; scikit-learn for machine learning models . ● Proven track record with over four years of experience successfully overseeing various projects and collaborating with interdisciplinary teams. ● Competent in presenting complex concepts to diverse audiences ● Organized, adaptable, and always eager to learn and deliver results. ● Experienced with software tools like Git, Bash, Visual Studio Code, PyCharm, SciWheel, Microsoft 365, Miro, Notion, and Adobe Illustrator. ● Languages: English (fluent), Russian & Kazakh (native speaker), German (A2) and ongoing