Systematic interaction interface and
variant characterization using
protein interaction profiling
Dissertation
Zur Erlangung des Grades
Doktor der Naturwissenschaften
Am Fachbereich Biologie
Der Johannes Gutenberg-Universität Mainz
Dalmira Hubrich
geb. am 28.07.1993 in Qostanay, Kazakhstan
Mainz, Oktober 2024
Dekan: Prof. Dr. Eckhard Thines
1. Berichterstatter: Prof. Dr Brian Luke
2. Berichterstatter: Dr. Anton Khmelinskii
Tag der mündlichen Prüfung: 21.10.2024
"The important thing is to never stop questioning." – Albert Einstein"
Acknowledgements
First of all, I would like to express my deepest gratitude to my thesis supervisor,
Katja Luck, for her contagious enthusiasm, inspiration, and guidance throughout
my journey into the fascinating world of protein interactions. Her constant support,
patience, and willingness to engage in discussions on any subject at any time have
been invaluable. Besides staying motivated, she also taught me what it means to
be a true scientist: how to embrace doubt, maintain a healthy sense of uncertainty,
and always stay critical and specific in my work. I am especially thankful for her
significant contributions to my publications, her immense effort in mentoring and
training, and her unwavering dedication to the development and completion of this
thesis.
Moreover, I am deeply appreciative of the opportunities she provided to discuss
and collaborate with other scientists, which allowed me to feel part of a larger
scientific community. This sense of belonging and engagement has been instrumental
in my development as a researcher.
I am immensely thankful to the Luck research group for their constant support,
both in the lab setting and with their computational efforts. Their professionalism,
collaboration, and willingness to contribute to this project have been invaluable. I
am particularly grateful to my colleagues for helping establish protocols, sharing
knowledge, and always being there to lend a hand. I am also grateful to our lab
manager, Mareen Welzel, for her unwavering support and for keeping our spirits
high with a steady supply of chocolate. Her sweet contributions not only made our
workdays brighter but also helped us power through many challenging experiments!
Special thanks to Dr. Chop Yan Lee, my partner in crime, for our successful team-
work and collaboration. Beyond the lab, he has also become a dear friend, and I am
grateful for his presence and support throughout my PhD journey.
I would also like to extend my sincere thanks to my TAC committee, Prof. Dr.
Brian Luke and Dr. Sandra Schick, for their invaluable contributions to my work.
Their advice and shared experiences helped me grow as a scientist, enhanced my
learning curve, and contributed greatly to my progress throughout this journey.
I would also like to extend my sincere thanks to Dr. Julian Konig and his research
group, particularly Stefanie Ebersberger and Dr. Miriam Murloz, as well as Prof.
Dr. Michael Sattler and his group, especially Dr. Klara Hipp, Dr. Hyun-Seo Kang,
and Dr. Santiago Martinez-Lumbreras. Being part of such a fruitful collaboration
was a valuable experience, and I am deeply grateful for the opportunity to engage
in meaningful discussions, share ideas, and learn from each of them.
I would also like to extend my gratitude to the Protein Production and Mi-
croscopy facilities and the Media Lab at the IMB Institute for their exceptional
support. Their assistance with producing efficient reagents, their expert consulting,
and the provision of cutting-edge equipment were crucial in addressing the scientific
questions in my study.
1
I would like to thank the Emmy Noether funding, which provided me with the
opportunity to pursue my PhD. This support has been instrumental in addressing
significant scientific questions, contributing to new knowledge, and applying current
insights to better benefit human society. I would like to express my gratitude to the
PhD program and the IMB community for the invaluable experience of pursuing
my PhD. The chance to meet and collaborate with esteemed scientists, exchange
knowledge, and learn from recognized experts has been a profound learning experi-
ence. This opportunity has not only deepened my understanding of science but also
helped me appreciate what it means to be a scientist.
Finally, my heartfelt thanks go to my family, especially my dearest husband and
best friend, Yannik Hubrich. His unwavering support, patience, and belief in me
throughout my PhD journey have been invaluable. I cannot imagine reaching this
point without his constant encouragement and understanding. His sacrifices and
steadfast presence have been a cornerstone of my success. Yannik has been a true
partner in every sense, sharing in the highs and lows, and his love and support have
been a source of strength and inspiration. I am deeply grateful for his belief in me
and for being my rock throughout this demanding journey. I am also grateful to
my dog, Sushi, who has been a calm and patient companion during the demanding
times when I had to fully immerse myself in science. His quiet presence and unspoken
understanding have been a source of comfort and joy.
Я также благодарна за поддержку и веру в меня со стороны моих
родителей и сестры. Несмотря на то что мы находимся далеко друг от
друга, они постоянно поддерживают моё стремление учиться, исследовать
и развивать карьеру. Их безусловная любовь всегда согревает моё сердце и
мотивирует меня становиться лучше для них. Спасибо за то, что привили
мне любовь к знаниям!
2
Contents
1 Introduction 5
1.1 The Complexity of Human Genetic Variation . . . . . . . . . . . . . . 5
1.1.1 Factors contributing to the complexity of variant interpretation 9
1.1.2 Comparative PPI profiling as the strategy to interpret variant
effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Modular architecture of proteins . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Folded domains . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Intrinsically disordered regions . . . . . . . . . . . . . . . . . . 19
1.2.3 Short linear motifs . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Domain-motif interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4 Predicting the known occurrence of DMIs in protein interactions using
sequence-based approaches . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5 Systematic experimental validation of putative DMIs . . . . . . . . . 31
1.6 Aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 The development of the medium-throughput cloning and the BRET
assay pipeline for the experimental validation of predicted DMIs 38
2.1 Preparation of the wild-type human ORFeome collection . . . . . . . 38
2.2 The assessment of the sensitivity of BRET assay . . . . . . . . . . . . 39
2.3 Article I: FUBP1 is a general splicing factor facilitating 3’ splice site
recognition and splicing of long introns . . . . . . . . . . . . . . . . . 41
2.3.1 Supplementary material . . . . . . . . . . . . . . . . . . . . . 80
2.4 Article II: Systematic discovery of protein interaction interfaces using
AlphaFold and experimental validation . . . . . . . . . . . . . . . . . 100
2.4.1 Supplementary material . . . . . . . . . . . . . . . . . . . . . 126
3 Systematic domain-motif interaction interface and variant charac-
terization using protein interaction profiling 164
3.1 Development of domain-motif interface predictor tool . . . . . . . . . 164
3.1.1 The workflow of the DMI predictor . . . . . . . . . . . . . . . 164
3.1.2 The application of the tool on HuRI PPI dataset . . . . . . . 165
3.2 Integrating ClinVar mutation data with putative DMIs mapped on
HuRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.3 The data-driven approach to select disease-associated proteins and
PPIs suitable for the experimental validation of DMIs . . . . . . . . . 168
3.3.1 Retestement of PPIs using BRET assay . . . . . . . . . . . . . 169
3.3.2 Testing the localization of the wild-type proteins and mutants
using Bioluminescence Imaging . . . . . . . . . . . . . . . . . 171
3.3.3 Validation of DMI predictions . . . . . . . . . . . . . . . . . . 172
3.4 The application of the strategy of the variant effect on PPIs . . . . . 189
3
4 Conclusion and future perspectives 202
4.1 Deciphering protein interaction interfaces using DMI predictor tool . 202
4.2 The application of DDI predictor and AlphaFold to map the PPI data
with interaction interfaces . . . . . . . . . . . . . . . . . . . . . . . . 203
4.3 Enhancing Predictive Accuracy of Variant Effects and Mutation De-
sign through Positioning on Predicted AF-MM Interface Structures . 204
4.4 Improvement of the BRET assay to validate the predicted interfaces . 204
4.5 General outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Appendix 208
5 Appendix 208
5.1 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.1.1 The medium-throughput cloning protocol . . . . . . . . . . . . 208
5.1.2 The medium-throughput site-directed mutagenesis . . . . . . . 227
5.1.3 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Bibliography 244
4
Chapter 1
Introduction
1.1 The Complexity of Human Genetic Variation
Genetic variation is a primary factor in evolution, driving the appear-
ance of new phenotypes with various degrees of adaptability to environ-
mental factors. A human genetic variation is defined as the diversity in
DNA sequences and genetic characteristics among the individuals within
populations (Alberts et al. 2002).
These variations arise from replication errors or spontaneous nu-
cleotide alterations that occur in DNA replication during cell division.
In addition to these endogenous factors, exogenous influences such as
radiation or chemicals can also cause changes in the genome. Genetic
variation occurs at different scales from structural to single-point muta-
tions. Structural variants are usually found to have a size of 1Mb and
happen on chromosome level (e.g.fragile sites) (Chaisson et al. 2019).
On the contrary, small variants span from duplications, deletions, inser-
tions and inversions to short nucleotide polymorphisms or SNPs (Nesta
et al. 2021) Figure 1.1).
Figure 1.1: Small and structural variants. On the left are the small variants
like single nucleotide variants (SNVs), insertion and deletion (indel). On the right,
examples of up to 1Mb changes like inversion, deletion, insertion, duplication and
translocation that constitute structural variants are shown. In each example, the top
chromosome is the reference and the variation is highlighted and displayed below.
5
The abundance of small variants is much higher than the structural
variants. About 85 million SNPs compared to 69000 structural variants
found in the human genome (Consortium et al. 2015). These mu-
tations can affect coding and non-coding regions or splice sites of the
genome. They can be inherited or occur de novo in the germline. Many
of these mutations are linked to various diseases, as they can alter pro-
tein structure or function, disrupt cellular processes, and contribute to
disease phenotypes (Visscher et al. 2012).
Over the last decade, the genomics field has rapidly advanced. The
development and application of large-scale next-generation sequencing
(NGS) such as whole genome (WGS) and whole exome (WES) sequenc-
ing, has significantly expanded the capacity for comprehensive analy-
sis of genetic variation. WGS is the method that sequences the entire
genome of an organism including coding and non-coding regions, and
captures all genetic variations. While WES is another approach focused
solely on the exome sequencing or coding regions of the genome. These
methods have both advantages and limitations, that should be taken
into account. WES is more cost-effective than WGS and useful for iden-
tifying the mutations that affect the protein function. On the other
side, it misses the variations in non-coding regions that can also impair
gene regulation and cause the disease. Additionally, it’s less effective in
finding structural variants in comparison to WGS. To further expand
our understanding of genetic variation, large-scale initiatives like the
Exome Aggregation Consortium (ExAC) have emerged. ExAC contains
the exomes of over 60,706 individuals, providing an extensive catalog
of genetic variants across the whole exomes (Lek et al. 2016). With
the integration of genome data, ExAC evolved into the Genome Aggre-
gation Database or shortly gnomAD. gnomADAD is the largest public
open-access human genome allele frequency reference database, which
contains exome sequencing data from 730,947 individuals and genome
sequencing data from 76,215 individuals.
Each individual’s data is annotated with the population information.
gnomAD includes data from various populations such as African, Latino,
East Asian, South Asian, European, and others. The collected sequenc-
ing data is computationally processed to identify variants like SNPs,
insertions, deletions, and other types of genetic variations. Up to now,
it houses 786,500,648 single nucleotide variants, 122,583,462 InDels and
over 1.2 million genome-level structural variants from more than 807162
individuals (gnomAD 2024). For each variant, the allele frequency is
calculated, where the number of times the variant appears is divided
by the total number of alleles observed at that position in the popula-
tion. Researchers use this database to find threshold levels of variant
6
frequency within and across different populations. The knowledge about
these levels help in understanding whether a variant is common or rare
globally or only within specific populations (Karczewski et al. 2020).
While gnomAD provides valuable information on the frequency of
genetic variants, it is not sufficient to determine which variants might
be disease-causing. This is because population data alone lacks clinical
and phenotypic information. For this purpose, patient data is essential.
Sequenced patient data allows researchers to prioritize genetic variants
with potential associations with specific diseases. To do prioritization,
patient data is compared with population data (control) such as pro-
vided by gnomAD. This comparison involves the frequency analysis to
examine whether a variant is more frequent in patients with the disease
compared to healthy controls. Variants that are rare or absent from the
general population, but found in affected individuals may be prioritized
for further study. Next, statistical analysis is applied to assess whether
the frequency of a variant is significantly higher in patients with the
disease. This is only possible when there is a sufficient amount of pa-
tient data available. This helps identify variants that are statistically
associated with the disease. The patient information is also used in
studying the inheritance patterns within families to see if a variant co-
segregates with the disease, which helps confirm its potential causative
role. Finally, these variants found through the comparative analysis
might undergo downstream functional studies.
Recognizing the need for comprehensive clinical data to better un-
derstand genetic variants, led to the development of patient databases.
They represent the archives of the reported variants from patients
submitted by clinical testing laboratories, research laboratories, locus-
specific databases, expert panels, and other groups. The largest patient
variant database currently known is ClinVar (Landrum et al. 2016).
It is maintained at the National Center for Biotechnology Information
(NCBI) within the National Library of Medicine and National Insti-
tutes of Health. Submissions to ClinVar must include a description of
the variant(s), the interpreted condition, the clinical significance, an
optional mode of inheritance, and supporting evidence.
Variants in ClinVar are classified based on the available evidence, in-
cluding genetic studies, population frequency data, computational pre-
dictions, functional assays, and clinical observations. Pathogenic vari-
ants have statistical association with disease in large studies, evidence
of segregation with disease in families, functional studies showing dele-
terious effects on gene or protein function, and consistent clinical ob-
servations in affected individuals. Benign variants are those that have
been found frequently in healthy individuals. Variants that are between,
7
due to factors such as low frequency, insufficient population data, lack
of functional studies or conflicting evidence are classified as variants
of uncertain significance or VUS. Currently, ClinVar stores over 4 mil-
lion submitted records and 2,966,675 genetic variants (Landrum et
al. 2016; ClinVar Miner 2024). Of these, approximately 1,527,893
variants are VUS (Figure 1.2).
Figure 1.2: The variant distribution in ClinVar database. The bar plot on
the left displays the total number of all variants (black) and the number of VUS
(Variants of Uncertain Significance) (gray). On the right, the bar plot illustrates
the distribution of various types of variants present in ClinVar for both the 2022 and
2023 versions.
Despite the remarkable advancements in sequencing technologies and
increased data availability for research, the main challenge persists. The
vast majority of variants remain poorly characterized. This number
continues to grow exponentially each year, posing significant challenges
for research and clinical practice. The accumulation of uncharacterized
variants without corresponding progress in variant interpretation limits
our understanding of the mechanism of diseases. Consequently, clini-
cians may struggle to interpret genetic test results, leading to potential
delays in diagnosis and the development of precision medicine, where
the treatment is tailored to each patient. For patients, this uncertainty
can potentially result in anxiety, unnecessary treatments and missed
opportunities for early intervention. Thus, why is it so challenging to
characterize the variant effect?
To address this question, it is important to understand the underlying
reasons behind the complexity of variant interpretation. Therefore, I will
discuss them in the next subsection.
8
1.1.1 Factors contributing to the complexity of variant inter-
pretation
The complexity of variant interpretation can be attributed to three pri-
mary reasons. First, the architecture of most diseases is highly complex,
involving multiple genes. This means that a single variant may not have
a straightforward impact on disease risk. Instead, its effect might be
modulated by other factors like the interactions between gene products,
the presence of other genetic variants or environmental factors. Even in
Mendelian disorders, where one gene is the primary cause of the disease,
the severity, onset and progression of a disease can still be influenced by
additional genetic factors.
Second, many uncharacterized variants that occur infrequently in the
human population, known as rare variants present a significant chal-
lenge. These variants are carried by only a small number of individuals.
Additionally, every healthy individual on average carries about 60 de
novo mutations (DNMs) that arise spontaneously and are not inherited
from either parent, so-called ultra-rare variants can be extremely diffi-
cult to interpret (Figure 1.3). This low frequency of rare and ultra-rare
variants results in small sample sizes, reducing the statistical power of
tests. Statistical power refers to the ability of a test to detect a true
effect when it exists. For instance, if a rare variant is present in only
0.1 % of the population in a study of 1000 participants, meaning that
only 1 person will carry that variant. The statistical power will be too
low because the small sample size reduces the distinction between true
association from random fluctuations in data. With too few individuals
carrying a rare variant, it becomes nearly impossible to apply standard
statistical tests (e.g. Chi-square, Fisher’s exact tests) effectively.
Figure 1.3: Schematic representation of the challenge in statistically as-
sociated de novo variants. Every healthy individual on average carries about 60
new coding mutations, most of them being ultra-rare in the human population. The
low occurrence of these mutations makes it hard to do statistical association and
discrimination of pathogenic from being variants.
Third, the impact of genetic variants on protein function can vary
widely, spanning from benign alterations with no distinct effect to mu-
9
tations that cause severe dysfunction of the protein or disease (Figure
1.4). The traditional view on the mutation effect destabilizing the pro-
tein and leading to the loss-of-function (LoF) has evolved. For example,
nonsense mutations possess a premature stop codon that leads to the
truncated version of the protein, often misfolded or destabilized. The
effect of the mutation might cause a severe phenotype. For example, a
nonsense mutation in the DMD gene results in dysfunctional protein mu-
tant Glu1157TER, causing muscular degeneration and Duchenne Mus-
cular Dystrophy (DMD) (Bulman et al. 1991). Likewise, frameshift
mutations, which result from deletions or insertions and alter the gene’s
frame can cause a total LoF due to extensive missense sequences followed
by premature termination.
Figure 1.4: Overview of the various effects of mutation on PPIs. Mu-
tations can have various effects: they may have no impact, destabilize and unfold
the protein, cause a gain-of-function effect or partially affect protein function by
disrupting PPI.
Although these types of mutations are known to be most detrimental
and cause the disease, the reality is more complex, as in the case of the
different mutations in the same gene causing distinct clinical outcomes
(Zhong et al. 2009). This complexity arises because the gene and
gene products do not function in isolation but in constant interactions
with each other building interactome networks (Vidal et al. 2011).
Goh et al (2007) assumed that some mutations possess a partial loss of
function perturbing the interactions within this complex network (Goh
et al. 2007). The studies suggest that about 30-60 % of all pathogenic
missense variants are destabilizing, whereas up to 30 % of pathogenic
missense mutations disrupt PPI, affecting some while leaving the others
intact (Sahni et al. 2013). Given this information, how can we effi-
ciently use it in variant characterization and learn about the potential
mechanism of the disease?
10
1.1.2 Comparative PPI profiling as the strategy to interpret
variant effect
The previous studies highlight the potential of using protein-protein in-
teraction (PPI) data to interpret variant effects (Zhang et al. 2009;
Vidal et al. 2011; Sahni et al. 2013). These interactions are repre-
sented as graphical networks, where proteins are displayed as "nodes"
and the interactions between them as "edges". One promising experi-
mental strategy that leverages this PPI network-based data is edgotyp-
ing, aimed to systematically characterize variants and reveal molecular
mechanisms potentially underlying disease (Sahni et al. 2015; Siwei
Chen et al. 2018; Wierbowski et al. 2018; Starita et al. 2018;
Fragoza et al. 2019). The idea is based on testing and comparing
the effect of benign, pathogenic, and uncharacterized mutations within
a protein on the protein’s interactions with its binding partners relative
to the wild-type interactions. This comparison helps to identify the "ed-
getic" effects, where a mutation might disrupt some interactions while
leaving the others intact. Thus, by comparing the obtained PPI pro-
files of benign, pathogenic, and uncharacterized variants we can predict
the pathogenicity of this variant based on whether the interactions that
are perturbed by the uncharacterized variant are similar to the interac-
tion perturbations observed for the pathogenic variant. Moreover, the
obtained perturbation data might be informative about potential mech-
anistic causes of the disease, making the strategy powerful at elucidat-
ing functional consequences of variants on PPIs and insight into variant
contributions to the disease that go beyond traditional sequence-based
studies.
Available human PPI datasets
To perform edgotyping, access to comprehensive and reliable protein in-
teraction datasets is crucial. Over the past 15 years, high-quality human
PPIs have been generated using large-scale approaches such as yeast two-
hybrid (Y2H) and affinity purification coupled with mass spectrometry
(AP-MS). They have become particularly instrumental in mapping PPIs
and generating large-scale reference protein interactome datasets (Rol-
land et al. 2014; Luck et al. 2020; Huttlin et al. 2021). Rolland et
al (2014) made a significant input into this field by presenting a broad-
ened version of the human interactome, HI-II-14, consisting of about
14000 distinct protein interaction pairs determined and confirmed by
three binary PPI assays (Rolland et al. 2014). This available dataset
has been instrumental in the research focused on variant characteriza-
11
tion. They further investigated the overall biological relevance of this
PPI dataset assessing mutations associated with human disorders com-
pared to common variants that showed no functional consequences on
biophysical interactions. They showed that more than 55 % of the 107
tested PPIs were perturbed by at least one disease-associated variant.
For example, the A129T mutation in the AANAT protein is known to
be associated with delayed sleeping phase syndrome. It specifically dis-
rupted the interaction with BHLHE40 involved in the regulation of cir-
cadian rhythm. Another study utilized HI-II-14 and overlapped it with
nearly 2000 mutations and identified 298 disruptive variants affecting
almost 700 human protein interactions (Fragoza et al. 2019).
In 2020, the human reference interactome or HuRI unveiled the
largest binary interaction map of human proteins using the Y2H ap-
proach. The HuRI project employed the Y2H technique, where two
proteins are co-expressed in yeast cells if they physically interact, re-
sulting in a total of 3 billion individual tests. This monumental effort
generated a dataset of over 50000 high-confidence binary interactions
between approximately 17000 human proteins. The depth and the scale
of this study significantly enhanced our understanding of the human in-
teractome and provided valuable datasets for elucidating the functional
impact of patient variants on protein-protein interactions (Luck et al.
2020). Luck et al (2020) also showcased the application of HuRI in
elucidating the mechanistic effect of missense variants on PPIs within
specific disease contexts. Mutations in PNKP have been associated with
microcephaly, seizures and developmental delay. They showed that the
pathogenic mutation Glu326Lys in PNKP disrupted the interaction with
TRIM37 predominantly expressed in the brain. Here, these studies
demonstrated that systematically generated human interactome maps
may significantly help in variant characterization.
In parallel, the Bioplex project generated a reference human inter-
actome using the AP-MS technique. This study involved systematic
protein purification and bound potential binding partners from cells, fol-
lowed by a mass spectrometric analysis of protein complexes. BioPlex
mapped about 120000 direct and indirect protein interactions (Huttlin
et al. 2021; Huttlin et al. 2017). In addition, it was also employed
for variant characterization, identifying how mutations affect not only
direct interactions but also complexes relevant to diseases. As a result, it
offers a broader view of variant functional impact on protein complexes
compared to Y2H. However, Y2H-detected interactome might be more
useful for studying variant effects, as it provides binary protein interac-
tions essential for comparative PPI profiling. This approach tests how
mutated protein affects each specific interaction, enabling the creation
12
of mutation profiles and their comparison with those of the wild-type
protein and its partners (Idrees et al. 2024).
Application of edgotyping strategy
Several successful attempts were made to perform this approach (Sahni
et al. 2015; Siwei Chen et al. 2018; Wierbowski et al. 2018;
Starita et al. 2018; Fragoza et al. 2019). For example, Sahni et al
(2015) generated interaction profiles for 460 mutant proteins and their
220 wild-type counterparts and found 521 perturbed interactions out
of 1,316 PPIs using the yeast two-hybrid (Y2H) interaction assay. This
huge experimental effort led to the identification of 197 mutations, where
26% identified as complete loss of interaction, 31% as edgetic and 43%
had no change in PPIs. Later Fragoza et al (2019) employed the same
assay and identified 298 out of tested 1676 missense population variants
that disrupted 669 human PPIs. They also used follow-up experiments
to further elucidate the effect of mutation on protein function. Taken
together these attempts showcase how shared disruption profiles can be
used to prioritize candidate disease-associated mutations.
Current challenges in edgotyping
While this approach holds significant potential for addressing the is-
sue of uncharacterized variants, it is still too expensive and laborious,
if it is entirely based on experiments given the amount of VUS that
needs to be characterized. While current tools like PolyPhen-2 and
MutPred2 predict variant pathogenicity primarily use metrics such as
conservation score or sequence-based features related to protein struc-
ture and function fail to capture the effect of mutations occurring in
less conserved but yet functional regions or rare and ultra-rare variants
with low conservation scores (Sunyaev et al. 1999; Adzhubei et al.
2010; Livesey et al. 2022). The recently developed AlphaMissence
tool excels in performance but also shows less effectiveness for variants
in these regions (Cheng et al. 2023). Given these limitations, com-
paring edgetic profiles of benign, pathogenic with VUS variants might
not always be sufficient to identify functional variants potentially con-
tributing to the disease. How can PPI profiling be improved for more
effective variant characterization?
To predict the variant effect on PPIs, one needs ideally to know
the exact residues that constitute the protein interaction interfaces (see
sections 1.2 and 1.3). Access to this information would be extremely
useful, as it helps pinpoint exactly where and how a mutation might
13
disrupt an interaction and elucidate the mechanistic effect and poten-
tial impact of a variant on disease development. This assumption is
supported by the study, where they reported a significant enrichment of
disease mutations found on the PPI interfaces (wang_2012). Although
we have protein interaction datasets available, they carry only binary
information, while the information on PPI interfaces is currently miss-
ing. Various experimental approaches such as X-ray crystallography, nu-
clear magnetic resonance (NMR) spectroscopy, cryo-electron microscopy
(cryo-EM) and protein fragmentation exist to detect PPI interfaces at
different resolutions (Martino et al. 2021). However, experimental
methods are labor-intensive and time-consuming. Indeed, only a small
fraction of interactions, about 4 % in the HuRI dataset have solved
structures (Luck et al. 2020). Given the limitations of experimental
studies, computational methods to predict PPI interfaces have gained
traction in recent years. The idea is to increase the predictive power
and map PPI data with putative PPI interfaces that further accelerate
the experimental validation of the putative PPI interfaces (see section
1.4). Finally, this information will be used for PPI profiling described
earlier.
1.2 Modular architecture of proteins
The prediction of PPI interfaces requires an understanding of protein
architecture to identify functional sites along with databases of known
functional sites to enhance the accuracy of these predictions. This will
be discussed in this section.
Proteins are complex molecules that play a crucial role in cellular
biological processes. Since the advent of molecular biology, we learnt
that proteins do not function in isolation, but in constant interactions
with one another or other molecules (i.e. RNA, DNA) forming complex
networks. These interactions are mediated by PPI interfaces formed by
specific regions within protein sequences, widely known as functional
modules (Campbell et al. 1991). These modules broadly can be cat-
egorized into defined and undefined structures. The defined structures,
commonly known as the globular domains, are the regions in protein
sequences that often independently fold into a stable tertiary protein
structure (Copley et al. 2002; Björklund et al. 2005). Those re-
gions that lack a defined structure are termed intrinsically disordered
regions or IDRs, where short linear motifs (SLiMs) are typically found
(Tompa et al. 2014; Davey et al. 2011; Davey et al. 2012). These
two types of functional modules will be explained further in the following
14
sections.
The modularity of the proteins is a crucial aspect of protein evolu-
tion and functionality (C. Vogel et al. Year; Han et al. 2004). This
modularity allows combining different modules to make proteins with
multiple properties and functions, facilitating the diversity of new traits
and adaptation to environmental changes (Apic et al. 2001). Around
65-70 % of proteins in eukaryotic organisms are composed of multiple
modules in their proteomes (Han et al. 2004). One prominent exam-
ple is the well-characterized Nuclear factor NF-kappa-B p105 subunit
or NFKB1, a multifunctional hub protein and transcription factor in-
volved in various cellular processes, such as transcriptional regulation,
immune response, cell proliferation and survival (Gilmore 2006; Hay-
den et al. 2008). It has been implicated in a broad range of cancers,
neurodegenerative diseases, and inflammatory and autoimmune diseases
(Gilmore 2006; Hayden et al. 2008; Taniguchi et al. 2018). The
N-terminus of 968 amino acid-long protein starts with the Rel homology
domain (RHD), followed by Ankyrin repeats and Death domain (DD)
at the C terminus (Williams et al. 2001; Glover 2004; J. Wang
et al. 2023). The disordered parts of the protein harbor many known
motifs such as nuclear export and nuclear localization signals, docking
and kinase modification motifs (Koonin 1996; Chen et al. 1996;
Rodŕıguez et al. 2000; Hsu 2007). Thus, the modularity in NFKB1
enables it to interact with many different partners, function in various
cellular processes and exemplify the complexity of the phenotypes that
can arise from the interplay of functional modules (Figure 1.5).
1.2.1 Folded domains
Biological role of domains
The foundational understanding of protein domains began with the work
of structural biologists Linus Pauling and Robert Corey in the 1950s.
Their research identified alpha helices and beta sheets as secondary
structures within proteins. Wetlaufer and Ristow (1973) introduced the
concept of protein domains or functional modules in their review of X-ray
crystallography studies of enzymes like lysozyme and immunoglobulins
(Blake et al. 1965; Freedman et al. 1966). They associated domains
with the regions, typically ranging from 50 to 350 amino acids in length
that are capable of folding autonomously. This understanding was facil-
itated by the development of experimental methods like crystallography
and NMR, which accelerated the identification and classification of these
protein modules, including commonly found domains in proteome such
15
Figure 1.5: Modularity of Nuclear factor NF-kappa-B p105 subunit. Do-
main and motifs as functional modules schematically illustrated in NFKB1. The top
panel shows the modularity of NFKB1. The numbers above and below the boxes
denote the boundaries of domains. The bottom panel displays the full-length struc-
ture of NFKB1 predicted by AlphaFold. Domains and motifs in the structure are
colored according to their colors in the top panel. RHD stands for Rel Homology
Domain, Ank stands for Ankyrin repeats, and DD stands for death domain.
as WD40, SH2, SH3, ANK, RING, PH and PDZ (Copley et al. 2002).
Domains fold into three-dimensional (3D) structures to achieve ther-
modynamic stability, positioning the hydrophobic residues in the pro-
tein core while exposing hydrophilic residues on the surface (Dill et
al. 2012). This ensures that the native conformation is the most en-
ergetically stable for the domain. Domains form the functional units
of a protein, enabling it to interact with partners and perform cellular
functions. For instance, the protein KLHL41 is involved in the ubiquitin-
proteasome system, which regulates protein turnover and degradation,
maintaining various biological processes like muscle development and
the function (Ramirez-Martinez et al. 2017; Yuen et al. 2020).
KLHL41 consists of three main domains: the Broad-complex, Tram-
track, Bric-a-brac (BTB) domain, BTB and C-terminal Kelch (BACK),
16
and Kelch repeats (Figure 1.6). The Kelch repeats of KLHL41 form
a beta-propeller structure that recognizes substrates such as nebulin
(NEB), a giant muscle protein that acts as a molecular ruler for filament
length and regulates actin-myosin cross-bridge cycling during skeletal
muscle contraction (Yuen et al. 2020). Upon binding, KLHL41 forms
a complex with NEB, while the BTB domain of KLHL41 directly binds
with cullin 3 (Cul 3), a scaffold protein in the Cullin-RING ubiquitin
ligase (CRL) complex, and can dimerize with itself to provide more
stability to the complex. Additionally, the BACK domain at the C-
terminus of KLHL41 supports and stabilizes the complex (Stogios et
al. 2004; Dhanoa et al. 2013; Gupta et al. 2014). Once the sub-
strate is formed, the ubiquitin molecules are transferred to the substrate
subunit, marking the substrate protein for the degradation by the pro-
teasomal system. This case illustrates how linking different domains
together in one polypeptide chain allows KLHL41 to maintain protein
homeostasis.
Figure 1.6: Domain architecture of Kelch-like protein 41 (KLHL41).
The top panel shows the schematic domain organization of KLHL41 (not drawn to
scale). The numbers above and below the boxes denote the boundaries of domains.
The bottom panel shows the putative full-length structure of KLHL41 as predicted
by AlphaFold. Domains and motifs in the structure are colored according to their
colors in the top panel. BTB stands for the Broad-complex, Tramtrack, Bric-a-brac
domain and BACK - for BTB and C-terminal Kelch domain.
While some domains can achieve stability independently or through
dimerization, others require assistance from zinc and metal ions or disul-
fide bridges. For instance, the zinc finger domain maintains its confor-
mation by binding to zinc ions. These zinc ions typically interact with
cysteine and histidine residues, acting as an anchor that reduces the
protein chain flexibility and supports the stable 3D structure (Berg
et al. 1997; Klug 2010). This stabilization is important for protein
functions such as DNA binding and gene expression. A good example is
the IKZF1 protein, also known as Ikarios, a zinc finger protein and tran-
17
scription regulator that plays a crucial role in lymphocyte differentiation
and function. It contains four C2H2-type zinc finger domains at the N
terminus that bind to zinc ions. The stabilized structure of a protein
interacts with DNA sequences in the promoter regions of targeted genes
and regulates their transcription. Different combinations of protein do-
mains exist widely across proteomes due to natural selection, acting on
these modular units to create diverse molecular machinery (Doolittle
1995).
Gene duplication and shuffling by recombination are likely to be the
driving forces of protein evolution and the complexity of the proteome.
While gene duplication leads to the emergence of similar domains oc-
curring in unrelated proteins, recombination enhances versatility and
allows proteins to specialize in specific cellular functions tailored to an
organism’s needs (Bagowski et al. 2010). For example, the PDZ do-
main is a 90-100 residues long structurally conserved module, found in a
vast array of proteins involved in diverse signaling pathways and cellular
polarity (Harris et al. 2001; Lee 2010). About 270 PDZ domains
are distributed over 150 proteins (Wang et al. 2010; Velthuis et al.
2011). Despite their conserved structural fold, these domains exhibit se-
quence divergence that contributes to functional specificity. Thus, PDZ
domains in the protein PSD-95 recognize C-terminal motifs on its tar-
get proteins, whereas the PDZ domain in cell polarity protein PAR6 was
shown to interact with internal ligands and other PDZ domains can form
homodimers (Kornau et al. 1995; Zhang et al. 2009; Fouassier et
al. 2000).
Protein domain databases
Previously, a big contribution to the discovery of protein domains was
done by sequencing projects. This effort helped to identify the conserved
regions across different proteins. With the power of bioinformatics, the
domains became identifiable using Hidden Markov models (HMMs).
HMM is a statistical model used to classify protein families based on
multiple sequence alignments (MSA) and detect sequence homology for
the identification of conserved regions within proteins (Bystroff et al.
2008). Thus, HMM became the main approach to collecting the data
and generating domain databases.
For example, the Protein families database (Pfam) and Simple Mod-
ular Architecture Research Tool (SMART) computed HMMs to build
protein and domain families based on the sequence similarity (R. D.
Finn et al. 2014; Schultz et al. 1998; Letunic et al. 2021).
While the SMART database has manually curated sequence alignment
18
which helped to define the domain boundaries more precisely, Pfam em-
ploys the automated approach and covers a broader range of domains
(Paysan-Lafosse et al. 2023).
1.2.2 Intrinsically disordered regions
Biological roles
In the mid-20th century, protein research primarily focused on folded
and ordered proteins. Studies on enzymes, where the denatured proteins
loose their catalytic activity, demonstrated the relationship between pro-
tein structure and function (Northrop 1930). It was assumed that a
protein requires a native folded structure to perform biological func-
tions. Thus, the protein structure-function paradigm was established,
while the abundance and functional role of disordered regions in proteins
in eukaryotes was unrecognized. However, unexpected behavior of pro-
teins such as missing electron density in X-ray crystallography studies,
increased sensitivity in the in vitro proteolysis experiments and solubil-
ity issues during protein purification processes led to the reassessment of
the structure-function paradigm. Pioneering work by Dunker (2002) and
Urevsky (2005) revealed that disordered regions are common in eukary-
otic proteins. Further studies challenged the long-standing belief that
protein functionality was strictly dependent on a well-defined and folded
protein structure (Tompa 2002; Dunker et al. 2002; Wright 1999;
Iakoucheva et al. 2002; Uversky et al. 2005; Uversky 2014).
It was shown that disordered regions of many regulatory and signaling
proteins can undergo disorder-to-order transitions upon binding to their
targets, which adds a layer of regulatory control and allows for complex
interactions (Uversky 2014). As a result of these findings, the scientific
community began to recognize the importance of protein disorder, lead-
ing to a significant shift in understanding protein biology. Consequently,
the paradigm was shifted to the "disorder-function paradigm".
Intrinsically disordered regions (IDRs) lack persistent 3D structure
under physiological conditions, continuously adopting the wide range
of dynamic conformations and forming transient secondary structures
(Wright 1999; Tompa 2011; Davey et al. 2019). These regions are
abundant in eukaryotic proteins, with predictions indicating that they
cover 30-40% of residues in their proteome (Tompa 2012; Van Roey
et al. 2012). IDRs also significantly contribute to the diversity and ver-
satility observed in organism evolution (Davey et al. 2015; Babu et
al. 2012; Weatheritt et al. 2012). In addition, they are often found
to overlap with post-translational modifications (PTMs), contributing
19
to functional versatility (Tompa 2012; Tompa et al. 2014). These
modifications can alter the conformation, stability and interactions me-
diated by IDRs (Uversky 2014). Due to the dynamic behavior, IDRs
are commonly involved in transient interactions regulating signal trans-
duction processes (Dyson 2005; Davey et al. 2019). A crucial finding
was that IDRs are enriched with functional interaction modules, such as
short linear motifs (SLiMs) mediating different multivalent interactions,
which will be discussed in the next section.
As IDRs play a significant role in signaling and cell regulation, they
are tightly controlled, and mutations in disordered sites have been asso-
ciated with human diseases, including cancer, diabetes, cardiovascular
and neurodegenerative disorders (Iakoucheva et al. 2002; Babu et
al. 2011). Vacic et al. (2012) investigated disease-causing missense
mutations on ordered and disordered regions and compared them to
neutral variants observed in healthy individuals without causing disease
phenotypes. They found that over 20 % of pathogenic variants reside
in intrinsically disordered regions and interfere with their functions. In
addition, the study by Peng et al (2012) emphasizes the importance
of understanding the context-dependent behavior of IDRs. They high-
lighted that the functional outcome of missense variants in these regions
could vary depending on the cellular environment and interaction part-
ners (Peng et al. 2012).
Despite the biological relevance of IDRs, only a small fraction of IDRs
have been characterized (M. Gouw et al. 2017; Davey et al. 2019).
Experimentally, defining disordered regions remains challenging. Due to
the dynamic structures of IDRs, the use of sophisticated methods such
as NMR, small-angle X-ray scattering (SAXS), circular dichroism (CD)
or Förster resonance energy transfer (FRET) is required (Felli et al.
2015; Holmstrom et al. 2016). Moreover, these regions function
in a context-dependent manner based on the cellular milieu including
pH, PTMs, and the presence of other proteins (Oldfield et al. 2014;
Wright 2015). These challenges necessitate integrative approaches
that combine experimental data with computational predictions. As
a result, various computational approaches have been developed to pre-
dict IDRs in proteins, leading to the generation of several databases
containing putative IDRs and experimentally verified.
Databases of disordered proteins and tool to predict the dis-
orderness
The DisProt is a comprehensive repository of experimentally verified
entries of proteins or regions within proteins that lack a stable three-
20
dimensional structure under physiological conditions, where each entry
is manually curated. The DisProt database annotates the disorder and
molecular functions curated from experimental studies. More than 2,000
eukaryotic intrinsically disordered proteins (IDPs) and 6,000 IDRs are
documented in this database.
IDRs possess distinctive characteristics that set them apart from
structured regions. One notable characteristic is the enrichment in polar
hydrophilic residues coupled with the depletion of hydrophobic amino
acids that help to stay soluble and flexible in the disordered state, and
incapable of forming sufficient interresidue interaction within a protein.
To discriminate between ordered and disordered sequences, the Intrinsic
Unstructured Protein Predictor tool (IUPred) developed the approach,
where they calculated the likelihood of interaction formations using a
statistical interaction potential (Z. Dosztányi 2018). These potentials
are further used to assess each residue in the protein sequence to esti-
mate their energies. Based on the energy, the residues estimated to have
the most favorable energies are predicted to be ordered, while those with
unfavorable energies are predicted to be disordered (Mészáros et al.
2009).
Recently, a new powerful tool AlphaFold2 (AF2) has emerged, pre-
dicting protein structures with accuracy comparable with experimental
structures (Jumper et al. 2021). AF2 predicts a full-length protein
structure generating the confidence score termed as Local Distance Dif-
ference Test (pLDDT). pLDDT score is calculated for each residue in
the protein structure, where it ranges from 0 to 100. A high score
means greater confidence in the accuracy of the prediction of a residue’s
position. Interestingly, the pLDDT was found to correlate with the dis-
ordering tendency, which can be used as a potential feature to predict
disorder (Wilson et al. 2022).Another feature is the solvent-accessible
surface area (SASA) of each residue is also correlated with the disorder
propensity of residues. One study used both pLDDT and SASA and
smoothed over a 20-residue window and outperformed IUPred2A, the
latest version of the predictor tool in their study (Akdel et al. 2022).
As AF has been used for other applications, they will be described in
section 1.4.
1.2.3 Short linear motifs
Biological roles
Short linear motifs (SLiMs) represent dynamic functional sequences,
ranging from 3-23 amino acids long. On average four residues are con-
21
served in the motif consensus sequence, but the remaining positions are
completely variable (Davey et al. 2012). Motifs typically lie in IDRs
or more rarely in disordered loops of structured regions and possess reg-
ulatory functionality such as directing ligand binding, providing docking
sites for enzymes and targeting proteins to specific subcellular locations
(Davey et al. 2012; Van Roey et al. 2014).
The concept of SLiMs appeared in the late 20th century. In 1980
Aaron Ciechanover, Avram Hershko and Irwin Rose identified degrada-
tion motifs or degrons that direct the target proteins to the ubiquitin-
proteasome system for degradation. Their groundbreaking work earned
them a Nobel Prize in 2004 and laid the foundation for discovering new
motifs. In 1990, Tim Hunt identified targeting signals such as KDEL en-
doplasmic reticulum retention motif, and the positively charged nuclear
and targeting sequences, while Pawson et al. (1986) discovered that Src
domains recognize motifs within protein partners, the interactions with
which regulate signaling pathways. These studies highlighted the im-
portance of motifs in protein function and regulation, opening avenues
for further exploration and discovery in molecular biology and cellular
physiology.
The discovery and validation of SLiMs have been performed by vari-
ous experimental methods such as traditional low-scale X-ray crystallog-
raphy and NMR as well as high-throughput systematic approaches such
as peptide microarrays, and phage display. Along with experimental dis-
coveries, the computational approaches have also significantly advanced
motif research. The motif detection techniques will be discussed in more
detail in sections 1.4 and 1.5.
It is estimated that more than 100,000 binding motifs exist in the hu-
man proteome, with many being uncharacterized (Tompa et al. 2014).
The discovered motifs are categorized into six classes based on their bi-
ological roles: ligand-binding sites, modification, targeting signals, de-
grons, docking and cleavage. Modification motifs include PTM sites
like phosphorylation. Targeting signals like nuclear localization signals
(NLS) are involved in protein trafficking to specific cellular compart-
ments. Ligand-binding motifs interact with binding partners to form
transient signaling complexes. Docking motifs facilitate substrate recog-
nition by enzymes without affecting the active site of these enzymes. The
cleavage motifs are recognized by proteases that cleave the protein at
the cleavage site (Van Roey et al. 2014). Another functional type
of motif is degron. Degrons, such as those, found in the protein AFF4
(Figure 1.7), are important for protein regulation. Specific ubiquitin
ligases like SIAH1 recognize these motifs which tag the target proteins
with ubiquitin molecules. This tagging process, known as ubiquitination
22
marks the protein for degradation by the proteasome system (Oliver et
al. 2004).
Figure 1.7: Degron motif on AFF4.
The top panel shows the schematic domain organization of AFF4 (not drawn to
scale). The numbers above and below the boxes denote the boundaries of domains.
AFF4 contains a degron motif that is recognized by ubiquitin ligase. The bottom
panel shows the putative full-length structure of AFF4 as predicted by AlphaFold
2. Domains and motifs in the structure are colored according to their colors in the
top panel. CHD stands for C-terminal homology domain.
Moreover, they mediate transient regulatory and signaling interac-
tions involved in biological processes like cell signaling, protein home-
ostasis and cell cycle. For instance, the 14-3-3 binding motif facilitates
the interaction of diverse proteins with 14-3-3 domains, thereby regulat-
ing their subcellular localization and activity of 14-3-3 proteins. Another
example is the SH3-binding motif (PXXP), found in numerous signal-
ing proteins, which mediates interactions with SH3 domains of other
proteins, facilitating the assembly of signaling complexes in response to
extracellular stimuli (Davey et al. 2012; Van Roey et al. 2014).
Additionally, SLiM mimicry can be used by viruses to interfere with
the host cellular machinery and thereby repurposing the host cell for
pathogen reproduction (Davey et al. 2011; Uyar et al. 2014). For
example, the Nsp3 protein of Eastern equine encephalitis virus (EEEV),
contains the motif LITFD that mimics the classical clathrin box motif.
This mimicry allows Nsp3 to interact with the beta-propeller repeat of
the N-terminal domain of clathrin (CLTC). This interaction disrupts
clathrin-mediated receptor trafficking and interferes with the signaling
processes, potentially suppressing antiviral signaling or altering cellular
23
functions to create a more favorable environment for viral replication
(Mihalič et al. 2023).
As opposed to globular domains, SLiMs are short functional peptides
and take up a very small sequence space. Consequently, IDRs can be
densely packed with multiple SLiMs, which can sometimes overlap and
act as regulatory switches. There are different switch mechanisms. One
of the mechanisms is switching the specificity of protein to its binding
partners like modification-dependent modulation of the intrinsic affinity
of the motif. The protein integrin beta 3 is the cell surface receptor
involved in cell adhesion and cell signaling (Tadokoro et al. 2003).
The NPxY motif in the disordered tail of the integrin beta 3 subunit
preferentially interacts with the PTB domain and membrane proximal
region of talin necessary for the integrin activation (Wegener et al.
2007). However, phosphorylation of the motif, particularly at posi-
tion Tyr747 switches the specificity to PTB of Dok1. Dok1 prefers to
bind exclusively to the central motif and does not interact with the
membrane-proximal region of the integrin tail necessary for activation.
Therefore, this mechanism ensures the control over integrin-mediated
cellular processes (Oxley et al. 2008).
Computational and experimental studies have shown that pathogenic
mutations in disordered regions often affect SLiMs. Uyar and colleagues
(2014) performed a proteome-wide analysis of disease-associated muta-
tions with a focus on SLiMs. Here, they utilized the mutation data from
healthy and patient individuals reported in databases such as Catalog of
Somatic Mutations In Cancer (COSMIC), 1000 Genomes Project, and
Online Mendelian Inheritance in Man (OMIM), respectively (Consor-
tium et al. 2015; Forbes et al. 2011; Hamosh et al. 2005). Next,
they mapped these mutations on SLiM derived from the experiment and
putative SLiMs using the IUPred tool and compared the distribution of
pathogenic and neutral mutations. The analysis revealed that disease-
related mutations are significantly enriched on SLiMs within intrinsi-
cally disordered regions (Uyar et al. 2014). Additionally, mutations
within SLiMs can disrupt motifs or create new ones. The study experi-
mentally showed that pathogenic mutants formed dileucine motifs that
often lead to clathrin-binding that underlies disease aetiology (Meyer
et al. 2018).
This accumulated evidence highlights the importance of SLiMs as
a key aid to understanding the molecular mechanisms in diseases and
underscores the need to integrate SLiM analysis into variant character-
ization studies.
24
Motif databases
While Pfam and SMART are valuable for predicting domain-involving
interfaces, motif databases can help identify potential SLiM-mediated
interfaces. The Eukaryotic Linear Motif (ELM) is a comprehensive
database developed by Toby Gibson and colleagues in the early 2000s.
The ELM database provides researchers with a catalog of manually cu-
rated and experimentally annotated validated SLiMs and tools for motif
prediction with the main focus on annotation and detection of SLiMs
(Puntervoll et al. 2003). Each record provides extensive information
on the motif sequence pattern, functional role, interaction partners, bi-
ological processes it influences and experimental evidence. Additionally,
the database has a search interface that allows users to query the motif
based on the sequence pattern, protein identifier, and species.
The ELM database categorizes motifs into functional types, classes
and instances. There are 6 functional types of SLiMs: ligand-binding
(LIG, e.g. WW1 binding motif), modification (MOD,e.g. CK1 phos-
phorylation site), targeting (TRG, e.g. NLS classical nuclear localiza-
tion signal), docking (DOC, e.g. USP7-binding motif), degradation or
(DEG, e.g. Siah binding motif) and cleavage or (CLV, e.g. NRD cleav-
age site). These types are grouped into 356 ELM classes based on the
binding domain of a partner, specific sequence characteristics, targeted
subcellular localization and other functional properties (Kumar et al.
2024). These classes incorporate 4283 individual ELM instances man-
ically curated from 4274 scientific publications and 2749 motif-partner
interactions (Kumar et al. 2024). Each instance has annotated de-
tails on the evidence like the experimental method used to determine
and characterize the discovered motif (M. Gouw et al. 2017). ELM
curators systematically described each ELM class using a regular expres-
sion (RegEx) format to define the key residues important for the binding
affinity and specificity of the motif (Davey et al. 2011). These regu-
lar expressions also capture the conservation pattern of different motif
types and, therefore, can be used in the prediction of putative motifs.
1.3 Domain-motif interfaces
Current understanding of protein-protein interaction inter-
faces
Protein interaction interfaces are formed through the interaction of pro-
tein modules, mainly globular domains and motifs. For example, the
binding between two globular domains is termed domain-domain inter-
25
face (DDI). DDIs involve multiple contacts and are characterized by a
high binding affinity, which contributes to the stability of protein interac-
tions (Nooren 2003). DDI interactions aid in stabilizing the formation
of protein complexes and are often involved in enzymatic activity, cell
signaling, cell adhesion and other cellular events.
Later, researchers found that in addition to DDIs, protein domains
can recognize SLiMs forming a domain-motif interface or DMI (Dyson
2005; Babu et al. 2012; Tompa 2012; Davey et al. 2012). DMI-
mediated interactions are weaker and more transient, playing a role in
major biological processes such as signal transduction, protein target-
ing to cellular compartments and protein homeostasis (Schreiber et
al. 2009; Zhou 2012). Therefore, maintaining these DMI interac-
tions is crucial, as their disruption can potentially lead to the disease
(Arimura et al. 2000; Uyar et al. 2014). Despite the importance
of DMIs, they are significantly underrepresented. Due to the transient
nature of these interactions, it is hard to detect using traditional ex-
perimental approaches described in section 1.5. Tompa et al (2014)
estimated the number of motifs in the hundreds of thousands or even
millions. Therefore, the last two decades have seen a tremendous rise
of interest in SLiMs interface-mediated PPIs in different research fields
like structural biology, systems biology and bioinformatics.
In my thesis, I will focus only on the systematic prediction of DMIs
followed by experimental validation and will use this information in com-
parative PPI profiling as the strategy for efficient variant characteriza-
tion.
Functional significance of Domain-Motif interfaces in cellular
processes and disease
In this section, I will describe several examples highlighting the func-
tional role of DMI interactions in biological processes and their implica-
tions for the disease.
A notable example of these is the degron motif with the pattern Px-
AxVxP, where x represents any amino acid) is found in the target protein
AFF4. This protein plays a critical role in transcription regulation and
chromatin remodeling and it is a core component of the super elongation
complex (SEC). SEC facilitates the efficient synthesis of mRNA tran-
scripts by RNA polymerase II (RNAPII) during transcription elongation
(Lin et al. 2010; C. Luo L. et al. 2012). This protein also helps
to recruit RNAPII to gene promoters and overcome the transcriptional
pausing. This activity is crucial for ensuring proper gene expression
profiles and supporting cellular function (Lin et al. 2010). The degron
26
motif of AFF4 is recognized by the substrate-binding domain (SBD) of
E3 ubiquitin ligase, SIAH1.
SIAH1 is the central component of a multiprotein Er ubiquitin ligase
complex and essential for protein level regulation within the cell. It has
been implicated in the regulation of programmed cell death. In some
studies, it has been identified as a tumor suppressor as it can degrade
the oncogenic proteins. This helps to prevent tumor formation and
progression. The recognition of AFF4 by SIAH1 has been previously
functionally annotated (Oliver et al. 2004). Upon binding this motif
forms a beta strand parallel to the beta-sandwich fold of the substrate
binding domain (SBD) of SIAH1. This interaction is known as the
beta augmentation mechanism. When the SBD of SIAH1 contacts the
degron of AFF4, it facilitates the ubiquitination of AFF4. Then this
tagged protein is degraded by the proteasome complex (Figure 1.8).
This biological process is important for maintaining homeostasis in the
cell by removing damaged and misfolded proteins and regulating protein
levels within the cell (Santelli et al. 2005).
While the mechanism of the interface between these proteins has been
annotated, the exact mechanism underlying the development of these
disorders is poorly understood, and many mutations found on these in-
terfaces remain uncharacterized. For example, the Met260Thr variant,
where methionine is mutated to threonine within the motif of the pre-
viously mentioned AFF4. The mutation was found in the patient with
a rare NDD called CHOPS syndrome and reported in Clinvar as VUS.
However, the diagnosis of CHOPS syndrome, caused by this rare mu-
tation is complicated. The limited number of documented cases makes
establishing diagnostic criteria and developing personalized treatment
difficult. Using our approach we know that the mutation is sitting on
the motif of AFF4 and might perturb the interaction with SIAH1. The
disruption of interaction may lead to the stabilization and accumula-
tion of AFF4 and cause developmental abnormalities characterizing the
disease.
Another example of the domain-motif mediated interaction is the
interaction between the 14-3-3 domain proteins and phosphorylated lig-
and motifs Figure 1.9 on the target proteins (Grozinger et al. 2000;
M. J. Wang K. et al. 2000). YWHAG (14-3-3 protein gamma) is one
of the proteins possessing a 14-3-3 domain which recognizes phospho-
rylated serine residues within the RAQSSP, RTQSAP and RKTASEP
consensus motifs of histone deacetylase 4 (HDAC4). This interaction is
known and the motif binding to 14-3-3 proteins was first described in
1997 by Yaffe et al. YWHAG is an adaptor protein localized in the cy-
toplasm. It belongs to the 14-3-3 protein family involved in signal trans-
27
Figure 1.8: The mechanism of interaction between SIAH1 and its target
partner AFF4.
Substrate-binding domain (SBD) binds to the degron motif on AFF4, which leads to
the ubiquitination of AFF4. Tagged protein is further degraded by the proteasomal
system (Oliver et al., 2004). CHD stands for C-terminal homology domain. The
structure of the interface is shown as predicted by AF2.
duction, protein localization, cell apoptosis and cell cycle. This protein
plays a crucial role in signaling pathways by binding to the phosphory-
lated motifs of its interacting partners. One of these proteins is HDAC4,
a transcriptional regulator, which deacetylates lysines at the N-terminal
region of the core histones H2A, H2B, H3, and H4 in the nucleus. The
previous studies described the mechanism of interaction and regulation
of HDAC4 and HDAC5 by YWAHG. In the inactive state, phospho-
rylated deacetylases are located in the cytoplasm, where they bind to
the 14-3-3 domain of YWHAG via three phosphorylated sites. These
interactions lead to the sequestration of HDAC4/5 to the cytoplasm
(Grozinger et al. 2000; M. J. Wang K. et al. 2000). This keeps
HDAC4 from entering the nucleus and repressing the transcription of
genes important for different functions like neuron development (M.-S.
Kim et al. 2012; Pennington et al. 2018). YWHAG is linked to
a type of developmental and epileptic encephalopathy that is character-
ized by neurodevelopmental impairment and the onset of seizures lead-
ing to delays in cognitive and motor development, whereas mutations
in HDAC4 are found in patients with neurodevelopmental disorder with
central hypotonia and dysmorphic facies (NEDSHF), brachydactyly and
intellectual disability. To illustrate how understanding the interaction
mechanism can be informative about the variant impact and the poten-
tial cause of the disease, consider the Glu247Gly mutation within the
RKTASEP motif in HDAC4 is associated with NEDSHF (Wakeling
et al. 2021). This mutation is documented as a pathogenic missense
variant in the ClinVar database. It is not reported in gnomAD and has
been determined as a de novo mutation. It was functionally studied,
where immunoprecipitation with HDAC4 with the Glu247Gly mutation
28
in HEK293 cells demonstrated a reduced binding affinity for another 14-
3-3 protein, YWHAB (Wakeling et al. 2021). As the PPI interface
is the same as with the YWHAG protein, we can assume this mutation
might also disrupt the interaction with the 14-3-3 domain like YWHAG.
By knowing the mechanism of interaction we can hypothesize that the
resulting reduced binding or loss of interaction with YWHAG may lead
to the impaired nuclear export of HDAC4, causing abnormal expression
of genes and contributing to the disorder.
Figure 1.9: Model of activity of HDAC4 through the interaction with
14-3-3 domain protein.
Upon phosphorylation of HDAC4, the phosphorylated ligand motif is recognized
by the 14-3-3 domain. This domain-motif interaction leads to the sequestration of
HDAC4 and HDAC5 to the cytoplasm, preventing them from downregulating gene
transcription (Grozinger et al., 2000).
1.4 Predicting the known occurrence of DMIs in
protein interactions using sequence-based ap-
proaches
The most efficient way to characterize DMIs would involve sequence-
based analysis and structural modeling. This combined approach in-
cludes two steps: 1) using sequence-based predictions to identify poten-
tial contact residues between proteins, and 2) structural modeling to
visualize and pinpoint inter-atomic interactions at the interface. Fur-
thermore, the predicted structural model of the putative interface can
aid in the experimental validation by designing the mutations assumed
to perturb the binding between the interacting regions. I will discuss
this part in more detail in Section 2, Article II.
One way to predict DMIs is by identifying the instances of known
DMI types. Databases like ELM contain a catalog of high-quality DMI
29
types manually curated based on experimental evidence. As ELM em-
ploys the regular expression patterns (see section 1.2.3) and HMMs of
the corresponding binding domains, it can help find known occurrences
of similar domains and motifs in the protein interactome (Weatheritt
et al. 2012; Edwards et al. 2014; Gouw et al. 2018).
The interaction of Eukaryotic Linear Motif (iELM) is the web server
that employs the annotated motifs from ELM and PPI data to iden-
tify putative SLiM-mediated interactions extracted from the STRING
database (Weatheritt et al. 2012). The iELM first checks for domain-
domain interfaces using the 3did, the DDI database (Mosca et al.
2014). If DDI is found, then the search stops. If no such interface is
found, it predicts motifs by employing ELM resource regular expressions
and aligning the sequence of the queried protein with their orthologs.
Predictions are scored using the SLiMSearch algorithm based on mo-
tif conservation (Davey et al. 2011). Next, putative motifs and the
flanking regions are evaluated for the intrinsic disorder propensity by the
IUPred tool (Dosztányi et al. 2005). Concurrently, motif-binding
domains are detected via the HMMSearch and optionally using Pfam
HMMs (J. Finn M. et al. 2010). The E-value derived from the HMM
match, conservation, and disorder score of identified motifs is used to
train a support vector machine to evaluate putative DMIs. If templates
for the putative DMIs are available, structural modeling is performed by
PepSite, which scores the biophysical feasibility of modeled DMIs (Pet-
salaki et al. 2009). The benchmarking iELM achieved a sensitivity of
84.8 % and a specificity of 86.5 % on its test set (Weatheritt et al.
2012).
Despite its good performance, the evaluation of iELM was done on the
imbalanced dataset, where the number of negative points outnumbers
the positive data points by almost 30-fold. Also, iELM halts the search
of potential DMIs, if any domain-domain interface type is found. Since
DMIs and DDIs are not mutually exclusive and can act synergistically
in interactions this approach may overlook potential DMIs. Moreover,
iELM builds HMMs tailored to specific motif-binding domains using
hand-curated sets of known sequences. This approach carries the risk of
overfitting, as HMMs can become too specialized for a narrow domain
data set. Additionally, iELM was not updated and is no longer in use.
These limitations motivated my former colleague to develop a DMI pre-
dictor tool, that I applied and experimentally validated putative DMIs.
The workflow of the tool and its application will be covered in Chapter
3.
While DMI interface predictions can be made, systematic experimen-
tal validation has to be done. Below, I will discuss various large-scale
30
methods and suggest suitable assay for the proposed strategy.
1.5 Systematic experimental validation of putative
DMIs
Today, various high-throughput methods for the systematic discovery of
PPIs have been developed.
Validation of putative interfaces can be done by using PPI interaction
assays that quantify the effects of mutations on PPIs, where mutations,
for example, were designed to validate predicted interfaces or were found
in patients. When mutations, designed to validate predicted interfaces
or identified in patients, reduce or eliminate binding compared to the
wild-type, it suggests that the interface is involved in the interaction.
However, the disruptive effect on the interaction by mutation can
be caused by other reasons such as partial misfolding, or complete un-
folding leading to the destabilization of the protein or its degradation.
Alternatively, it can cause the mislocalization of the other subcellular
compartment and/or further lead to protein degradation. Therefore, it
is essential to use a method that allows monitoring of protein expression
levels and provides a quantitative score indicating the binding strength
of interactions.
In this section, I describe different in vitro and cell-based methods ca-
pable of identifying PPIs, and potential assays suitable for experimental
validation of putative DMIs.
PPI methods are broadly classified into binary methods or co-
complex methods (Table 1, 1-2). For example, AP-MS is known for
its scalability in the systematic interaction mapping (see section 1.1).
Due to the design and principle of the method to detect protein asso-
ciations rather than direct PPIs, it would not be effective for domain-
interface validation. Moreover, this assay may fail to detect transient or
weak interactions during lysis and washing steps.
On the other hand, in-vitro methods like ITC, SPR, FP and
MST (see Table 1, 3-6) detect likely direct interactions and pro-
vide real-time information on the binding affinity of these PPIs
(Ward2001; Stahelin2013; Pierce et al. 1999). While these meth-
ods are quantitative and can assess the effect of mutations on interac-
tions, they require purified proteins, which can be time-consuming and
expensive equipment making these assays less scalable. Due to complica-
tions in the purification step, only potentially binding protein fragments
are used, making it unclear how the interaction occurs in a full-length
context. Additionally, since these assays operate outside the native cel-
31
lular context, the validation of domain-motif interactions (DMIs) in cells
remains uncertain.
Another method is Cross-linking (XL-MS), performed in both in vitro
and in cell-based systems (see Table 1, 7) is valuable for discovering
new interfaces, as it captures contact residues in close proximity and
provides structural insights. However, it is less suited for interface val-
idation. For instance, the washing step may fail to catch DMI-driven
PPIs and inefficient cross-linkers may capture intra-protein contacts,
complicating the analysis. Designing mutations for validation can be
challenging, as the cross-linkers target specific residues. This method
does not allow for comparing the effect of mutation on the binding affin-
ity of PPIs compared to the wild-type proteins. While useful for dis-
covering new interfaces, XL-MS is not suitable for validating interaction
interfaces.
32
33
Able to test potential Able to study the effect of Able to measure 
Assay Name Type Assay is based on… Assay detects… Scalable? effect of mutation on mutation on binding protein expression Able to check
specific PPI? affinity of specific PPI? levels? protein localization?
Comments
Affinity Purification (AP)-Mass In vitro Affinity purification* of a bait protein along with its prey 
*protein purification might be time-consuming  and large sample amounts are 
(associated) partners, followed by MS Protein complexes Yes No No No No requiredSpectrometry (MS)
Co-immunoprecipitation (Co-IP) In vitro The use of specific antibodies* to pull down a target protein 
-Mass Spectrometry (MS) along with its prey (associated) partners,, followed by MS
Protein complexes No No No No No * Expensive (e.g. due to  the need for specific antibodies)
Isothermal Titration Calorimetry In vitro Measuring heat changes if two proteins interact Likely direct PPI No Yes Yes No No
(ITC)
Surface Plasmon Resonance In vitro Measuring changes in refractive index to quantify binding Likely direct PPI Yes Yes Yes No No
(SPR)
Microscale Thermophoresis In vitro Measuring the thermophoretic movement of molecules in a 
(MST) temperature gradient to quantify binding.
Likely direct PPI No Yes Yes No No
Fluorescence Polarization (FP) In vitro Measuring changes in the polarization of fluorescent light emitted by a fluorophore. Likely direct PPI No Yes Yes No No
Cross-linking Mass Spectrometry In vitro / Using chemical cross-linkers to capture protein-protein 
*Not suitable for this (or the mutation design is quite complicated, as cross-linkers 
Cell-based interactions, followed by mass spectrometry to identify cross- Likely direct PPI Yes No* No No** No**
recognise specific residues, therefore the mutation has to be done or occur on them)
(XL-MS) linked peptides. **Can be if it is cell-based assay, where proteins are tagged followed by measurement of the tag signal (e.g. fluorescence)
Cell- DNA-binding and activation domains fused to interacting *Proteins are forced to be in the nucleus of the yeastYeast Two-Hybrid (Y2H) based* proteins Likely direct PPI Yes Yes No No** No** **If the proteins are tagged prior to transformation and checked by flow cytometry and microscopy
Protein Fragment In vitro / The reconstitution of a transcriptional activator when two 
Complementation Assay (PCA) Cell-based proteins of interest interact
Likely direct PPI Yes Yes Yes No No 
using proximity-dependent ligation of oligonucleotide-
Proximity Ligation Assay (PLA) Cell-based conjugated antibodies to create a signal that is amplified and Likely direct PPI No Yes No No No
quantified if the proteins are within close proximity.
Fluorescence Resonance Energy In vitro / 
Cell-based Measures energy transfer between two fluorophores Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imagingTransfer (FRET)
Bioluminescence Resonance In vitro / Detecting the energy transfer between a bioluminescent donor 
Energy Transfer (BRET) Cell-based and a fluorescent acceptor when they are in close proximity.
Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imaging
MAPPIT (Mammalian Protein- Reconstituting the JAK/STAT signaling pathway through the Cell-based interaction of bait and prey proteins, leading to reporter gene Likely direct PPI Yes Yes Yes No No
Protein Interaction Trap) activation
Cell-based 
Luminescence-based two-hybrid followed 
by Cell- BRET based assay followed by Co-IP Likely direct PPI Yes Yes Yes Yes No* *The localisation can be checked if combined with imagingassay (LuTHy)
free
Table 1 The overview of different in vitro and cell-based methods to detect PPIs.
In parallel, cell-based methods have been developed. Cell-based bi-
nary methods detect PPIs mostly based on co-expression of genetically
tagged proteins. If these proteins interact, their tags come into proxim-
ity, producing various readouts to indicate a PPI. For example, common
read-outs include the reconstitution, activation or expression of reporter
proteins. A well-known example is the Yeast Two-Hybrid (Y2H) assay,
where the DNA-binding domain is fused to a bait protein, and the tran-
scription activation domain is fused to a prey protein (Chien et al.
1991; Fields et al. 1989). When the bait and prey interact, the
transcription factor is reconstituted, activating the reporter gene. The
presence of interaction is indicated by the activation of the reporter gene
and the growth of the yeast.
While Y2H has been attempted to be used for interaction profiling
and studying the effects of mutations on PPIs (see section 1.1), it
cannot directly indicate whether reduced yeast growth is due to a par-
tial misfolding, or unfolding of the proteins, as Y2H does not allow to
monitor the protein expression in a real time. Other additional tech-
niques like western blotting are needed for validation. On the other
side, fluorescent tagging of proteins before transformation followed by
flow cytometry can also be used to check protein levels, but this requires
additional steps, costs and expertise. Overall, while Y2H is a simple and
useful assay for detecting PPIs, its limitations hinder its ability to fully
characterize interaction interfaces.
Following principles similar to Y2H, many binary methods have been
subsequently developed to mitigate the shortcomings of Y2H. Examples
of these methods include the Protein Fragment Complementation Assay
(PCA) and the Mammalian Protein-protein Interaction Trap (MAP-
PIT) assay (see Table 1, 9-10). PCA, a reporter protein (e.g. GFP)
is split into two non-functional fragments. These fragments are genet-
ically fused to the proteins of interest, one to each fragment. When
the two proteins interact, the fragments come into proximity, allowing
the reporter protein to reassemble and regain its functional state, which
serves as a readout for the interaction. The advantage of this assay is
the detection of likely direct PPIs in living mammalian cells, therefore
providing a more optimal cellular context for testing human proteins for
the interaction. Similarly, MAPPIT is based on the reconstitution of
the JAK-STAT signaling pathway, a key pathway involved in cytokine-
mediated signal transduction. In MAPPIT, a mutated cytokine receptor
fused to a bait protein recruits JAK upon interaction with a prey protein,
leading to STAT activation and reporter gene transcription (Lievens et
al. 2011). Due to the involvement of this pathway, the assay is limited
to PPIs that occur near the plasma membrane. Additionally, steric hin-
34
drance might interfere with potential interactions. Both methods pro-
vide the mammalian context of tested interactions, but it is not possible
to monitor protein expression which is essential for the characterization
of putative DMIs.
Alternative to the methods that rely on the reconstitution of the re-
porter protein, methods like Förster resonance energy transfer (FRET)
and Bioluminescence resonance energy transfer (BRET) offer more di-
rect readouts based on physical proximity. These assays detect PPIs
through non-radiative energy transfer between a donor and acceptor
molecule, which occurs only when they are in close proximity. In FRET,
proteins are fused with donor and acceptor fluorophores, and upon inter-
action, energy is transferred from the donor to the acceptor, generating
a detectable fluorescent signal (Sekar et al. 2003; S. S. Vogel et al.
2006; Grünberg et al. 2013).
In BRET, luciferase is used as a donor and fluorescent protein acts
as an acceptor. The donor is not excited with monochromatic light at
its specific excitation wavelength. Instead, the luciferase donor is acti-
vated by a chemical substrate, such as coelenterazine-h. This substrate
undergoes oxidation by the luciferase enzyme leading to the emission
of light (Xu et al. 1999). For example, the Nanoluc luciferase tag
when using coelenterazine-h emits light with a maximum wavelength of
460 nm (Hall et al. 2012). Upon the addition and oxidation of the
substrate, when the proteins are in proximity, the energy is transferred
from the donor to the acceptor. The emitted luminescence is commonly
detected at the short wavelength of the donor, and the long wavelength
of the acceptor. The ratio of acceptor energy over donor is the BRET
ratio, indicating the potential interaction (Pfleger et al. 2006). Both
methods provide real-time study of transient PPIs in living mammalian
cells. Monitoring protein expression levels is crucial for characterizing
interaction interfaces and understanding potential interaction failures.
To quantify the binding affinity, saturation experiments can be also per-
formed in both methods, where the quantity of one interaction partner
is kept constant while increasing amounts of the other protein (Pfleger
et al. 2006; Trepte et al. 2018).
Along with monitoring the expression levels of proteins, it is possible
to check the localization of mutated proteins relative to their wild-types.
For example, it can be achieved with bioluminescence imaging (BLI)
using high-content screening (HCS) microscopy (J. Kim et al. 2024).
In a high-content screening system, a plate with the co-expressed tagged
proteins in cells is visualized. The HCS is equipped with high-sensitivity
cameras and appropriate filters to detect the specific wavelengths of
light emitted by the tags. First, the fluorescence expression proteins
35
are captured and then upon the addition of substrate luminescence is
measured.
The main limitation lies in the sensitivity of the assay and the ori-
entation of the tag. As it relies on the proximity it may catch indirect
interactions that are involved in a protein complex. In contrast, the
real PPI might not be detected due to the steric hindrance of the tags
leading to false negatives.
In contrast to FRET, BRET offers several advantages:
• the use of luminescence and the substrate in BRET excludes ac-
ceptor cross-excitation and donor photobleaching, which simplifies
data analysis
• the reduced auto-fluorescence
• luciferase provides a high sensitivity due to increased signal-to-
background ratios
• lower amounts of DNAs are sufficient due to the high sensitivity
Hence, these advantages make BRET a suitable approach for the
validation of putative DMI interfaces. In a recent study Wanker and
colleagues combined BRET and Co-IP with a luminescence-based read-
out in one method (Trepte et al. 2018). This method named
luminescence-based two-hybrid assay, shortly LuTHy, provides a double-
readout for PPI detection, which enhances the confidence of identified
PPIs in a high-throughput.
Overall, the advantages of the BRET assay might be the optimal
choice to be incorporated into the proposed strategy to tackle the ques-
tions addressed in my study. Wanker lab kindly provided us with the
necessary donor and acceptor vectors, as well as the controls for our
study described in Chapters II and III.
1.6 Aims of the thesis
Despite advances in sequencing technologies, most genetic variants re-
main poorly understood, hindering our grasp of disease mechanisms and
complicating clinical diagnosis and treatment (see subsection 1.1.2).
Edgotyping has been proposed as a strategy to address this challenge
(see subsection 1.1.2). While several studies have attempted to apply
this strategy using Y2H, to do it entirely experimental is laborious and
expensive and, therefore less efficient. To address this question, my goal
of the study is to propose a systematic approach enabling the character-
ization of variant effects by predicting DMIs and using this information
36
for PPI profiling. This approach will include both computational and
experimental methods.
First, I aimed to build the experimental pipeline to validate puta-
tive DMI interfaces. To achieve this, I need binary PPI data that can
serve as a resource for discovering new interfaces. The HuRI dataset is
the largest dataset of binary protein-protein interactions (Luck et al.
2020) described in thesis section 1.1.2. In our lab, we have full access
to the open reading frame (ORF) HuRI collection. However, the OR-
Feome collection currently exists in a single copy, while creating multiple
copies for use and storage is essential. Since cloning procedures and site-
directed mutagenesis are necessary for mutating proteins of interest and
testing PPIs I will probe the cloning in tube format first, then adapt
it to the plate format. Moreover, as the BRET assay has not yet been
established in our lab, I will also assess its sensitivity.
The second aim is to employ a data-driven approach to select PPIs
with predicted DMIs suitable for experimental validation. First, DMIs
need to be predicted. My colleague developed a DMI tool, used this tool
to generate predictions and mapped putative DMIs on the HuRI PPI
dataset. To get mutations that may fall into predicted DMIs, another
colleague processed the mutations from the largest patient database,
ClinVar and overlapped the ClinVar mutations with mapped interface
predictions. To further select PPIs mapped with DMIs and overlapped
with mutation data suitable for the experimental validation, I need to
know which ORFs and which isoforms are available in the ORFeome col-
lection, and how many of those are cloneable and present in a full-length
context. Moreover, manual inspection of predicted DMIs for biological
relevance will be done. To do these proteins will be annotated with ex-
perimental and biological information. Furthermore, for the experimen-
tal validation of selected PPIs, controls such as known DMI-mediated
PPIs and the PPIs mediated by different interfaces like DDI served as
positive and negative controls will be also chosen and included in the
study.
37
Chapter 2
The development of the
medium-throughput cloning and the
BRET assay pipeline for the
experimental validation of predicted
DMIs
2.1 Preparation of the wild-type human ORFeome
collection
As stated in Aim 1, the availability of a comprehensive ORFeome col-
lection is essential for my project. This collection provides access to
GATEWAY-compatible clones for most wild-type proteins from the
HuRI dataset, which are necessary for cloning into LuTHy expression
vectors that will be used in interaction profiling.
My supervisor Katja Luck brought one copy of a human ORFeome
collection from her PostDoc lab comprising ORFs for around 17,500
human protein-coding genes. These ORFs are stored as GATEWAY-
compatible clones, allowing them to be transferred to the destination
vectors carrying the fluorescence and luminescence reporter tags needed
for BRET assay. As this collection only came in one copy, for mainte-
nance and safety reasons, together with my colleague, I adapted and
optimized the protocol for making 3 copies of the ORFeome using
the Rainin liquidator 96 Manual pipetting system, kindly provided by
Khmelinskii group (Figure 2.1). The first copy serves as a working
collection, the second copy will be backed up and the final copy is sup-
posed to be given to the media lab for the IMB community as an open
resource. Overall, the 2-day protocol enables handling 16 96-well plates
38
for making three copies. Original plates are thawed and fresh plates
with media are inoculated and placed for incubation overnight. The in-
cubation can be challenging due to a vast evaporation effect that leads
to losing the volume needed to make three copies on the second day
after the incubation. Therefore, to optimize this step we tested different
incubators, materials to seal plates, boxes to cover plates in the incu-
bator, and testing the well volumes we could use. In two months, we
successfully copied 238 plates.
Figure 2.1: The scheme of cloning pipeline followed by BRET assay.The
ORFeome collection was copied, where one was a backup copy, the second was made
for the IMB community and the third served as a working copy. The ORFs selected
for cloning were incubated in 96 deep-well plates overnight, and DNA on the next
day was verified by sequencing. Next, the clones were shuffled from the donor
GATEWAY vector to the destination vector using the LR reaction. The mutant
constructs are generated by site-directed mutagenesis and sequence verified. Upon
cloning, the BRET assay is performed.
2.2 The assessment of the sensitivity of BRET assay
To evaluate the sensitivity of the assay chosen for the experimental val-
idation I needed to adapt the cloning of GATEWAY vectors from one
tube to plate format and adapt the mutagenesis protocol to a medium-
39
throughput pipeline. To do this I used the open reading frames (ORFs)
coding for proteins, mutations and PPIs as well as controls from my
collaborative project with the Koenig (IMB) and Sattler (Institute of
Structural Biology, Helmholtz German Research Center for Environ-
mental Health ) groups.
The Koenig lab has recently established Far Upstream Element Bind-
ing Protein 1 (FUBP1) as a novel regulator in mRNA splicing. Our aim
in this project is to aid in the identification of PPIs between FUBP1
and known protein components of the 5’ and 3’ splice sites as well as of
the branchpoint on the mRNA and to delineate the corresponding in-
teraction interfaces. For this project, I generated 64 different constructs
using the cloning technique followed by BRET assay (Figure 2.1). To
test the sensitivity of the assay, I transfected different ratios of ORFs
in donor-acceptor constructs (1:10 ng, 1:20 ng, 1:50 ng, 1:100 ng, 1:200
ng) into HEK293 mammalian cells for co-expression.
Along with these pairs, I also included the standard controls. Wanker
group, developers of the LuTHy method kindly provided us with stan-
dard controls including empty vector controls to rule out background
effects from the vector, donor-only (NanoLuc) and acceptor-only (mC-
itrine) constructs to ensure interactions require both constructs and non-
interacting protein pairs to check for false positives. A positive control
pair with the known protein-protein interaction BAD-BCL2L1 was in-
cluded to validate the system’s functionality.
Additionally, I used random protein pair controls for each tested pair,
consisting of proteins not expected to interact, such as those with differ-
ent cellular localizations (e.g., nuclear proteins paired with cytoplasmic
proteins). As proteins of interest are localized in the nucleus and pro-
teins from protein pairs are found in the cytoplasm, we paired up the
tested protein of interest with one protein from the positive pair.
I tested all interactions together with controls and quantified BRET.
The corrected BRET (cBRET) ratio is calculated by subtracting either
the BRET ratio of controls (donor-only (i.e. NanoLuc) and acceptor-
only (i.e. mCitrine) constructs) from the BRET ratio of the studied
interaction of interest. Our findings showed that cBRET values for the
weak interactions were close to the cBRET values of the random pairs.
Based on this information, I learned that a high amount of transfected
cDNA might lead to the generation of false-positive data. Therefore,
we questioned those findings and evaluated assay specificity by testing
the range of different DNA ratios of the previously tested interactions. I
discovered that 1:50 ng appears to be a good ratio for the discrimination
of significant from non-significant cBRET signals (Figure 2.2).
In summary, copying the ORFeome collection and testing BRET as-
40
Figure 2.2: The evaluation of BRET’s sensitivity with 1:50 ng of donor:
acceptor DNA ratio. The plot represents calculated cBRET ratios for tested
FUBP1 (orange), U2AF2 (red) interactions, positive controls (green) and random
pairs (gray) as a function of acceptor to donor (acc/don) protein expression ratio.
All values are the mean +/- s.d. from two technical replicates.
say sensitivity enabled quick access to the ORFs and helped define the
ratio of tested constructs needed for the transfection to avoid false pos-
itives. This allowed us to explore the application of BRET assay in in-
teraction profiling to further explore protein-protein interactions (PPIs)
involved in mRNA splicing.
2.3 Article I: FUBP1 is a general splicing factor fa-
cilitating 3’ splice site recognition and splicing
of long introns
Summary
The splicing of pre-mRNA plays a crucial role in gene regulation and
the expansion of the proteome in eukaryotes. However, the information
on how the recognition of splice sites and pairing during spliceosome as-
sembly occurs lacks details. This project focused on understanding the
role of FUBP1 in RNA splicing, particularly its function in the recogni-
tion and processing of 3’ splice sites and splicing of long introns. Using
in vivo iCLIP analysis we found that FUBP1 binds to 91.3 % of 3’
splice sites in a similar pattern as core splicing factors like SF1, U2AF2
and SF3B1. Further investigation showed that FUBP1 recognized cis-
regulatory RNA motif located upstream of the branch point (BP) in
41
pre-mRNA. Through EMSA and ITC experiments, we demonstrated
that FUBP1 binds GU-rich sequences. This was further validated by
NMR and in vivo iCLIP data showing that KH domains of FUBP1
independently recognize these motifs. Moreover, kinetic modeling and
transcriptional profiling demonstrated that FUBP1 is required for effi-
cient splicing of long introns, which represent 80 % of human introns.
Next, we explored the interactions of FUBP1 with other splicing fac-
tors. First, we studied FUBP1 interactions with components of spliceo-
some complexes. Here, NMR analysis provided insights into the interac-
tion between FUBP1 and U2AF2, the key component 3’ splice complex.
The preliminary structure from the NMR study suggests that the second
RRM domain of U2AF2 and the N-terminal N-box of FUBP1 protein
represent the minimal binding regions. Furthermore, we found that the
amino acid change from alanine to aspartate at residue 38 (A38D) sitting
at the N-box of FUBP1 disrupts the interaction using recombinant frag-
ments of FUBP1 and U2AF2. This data was supported in a full-length
context using the BRET experiments in mammalian cells (Article I,
Figure 3, C & J). Given the obtained BRET data, we observed that
the presence of mutation significantly increased the distance between
proteins, but the interaction was not completely disrupted. Here, we
hypothesize that mutated FUBP1 and U2AF2 still interact due to the
binding to the same mRNA but the contact between both is much weaker
due to the lost direct interface between both. With BRET we also con-
firmed the known interaction interface between FUBP1 and SF1, pro-
tein of U2 complex at the 5’ splice site (Article I, Figure 3, C &
D). Furthermore, we tested the interactions of FUBP1 with U1 snRNP-
associated proteins, including SNRPA, SNRPC, TIAL1, PRPF40B and
SNRBP as well as TCERG1 and KHDRBS1. We further confirmed these
interactions of FUBP1 with U1-associated proteins with BRET and/or
NMR. Interestingly, NMR analysis proposed that FUBP1’s A/B boxes
interact with proline-rich regions from SNRPB. These changes were less
pronounced with SNRPA and PRPF40B containing similar proline-rich
stretches (Article I, Figure 6, G).
Overall, this study provided a comprehensive analysis demonstrating
the global role of FUBP1 pre-mRNA splicing processes. Our key findings
suggest that FUBP1 acts as a general splicing regulator at the 3’ splice
site. Moreover, many tested interactions mediated via domain-motif
interface were able to be detected by BRET, and disruption of these
interfaces by point mutations established this assay as a valuable system
to validate predicted DMIs.
42
43
Article
FUBP1 is a general splicing factor facilitating 30
splice site recognition and splicing of long introns
Graphical abstract Authors
Stefanie Ebersberger, Clara Hipp,
Miriam M. Mulorz, ..., Katja Luck,
Michael Sattler, Julian König
Correspondence
k.luck@imb-mainz.de (K.L.),
michael.sattler@helmholtz-munich.de
(M.S.),
j.koenig@imb-mainz.de (J.K.)
In brief
Ebersberger et al. identify the RNA-
binding protein FUBP1 as a key splicing
factor that binds to a hitherto unknown
cis-regulatory motif at 30 splice sites.
Multivalent interactions of FUBP1 with
splice site components support
spliceosome assembly at multiple stages
and ensure efficient splicing of long
introns.
Highlights
d FUBP1 recognizes a ubiquitous cis-regulatory RNA motif
upstream of the branch point
d Multivalent interactions in disordered FUBP1 regions support
spliceosome assembly
d FUBP1 affects long introns, which are prevalent in humans
and altered in cancer
d Kinetic modeling and protein interactions implicate FUBP1 in
splice site bridging
Ebersberger et al., 2023, Molecular Cell 83, 2653–2672
August 3, 2023 ª 2023 The Author(s). Published by Elsevier Inc.
https://doi.org/10.1016/j.molcel.2023.07.002 ll
44
ll
OPEN ACCESS
Article
FUBP1 is a general splicing factor
facilitating 30 splice site recognition
and splicing of long introns
Stefanie Ebersberger,1,12 Clara Hipp,2,3,12 Miriam M. Mulorz,1,12 Andreas Buchbender,1 Dalmira Hubrich,1
Hyun-Seo Kang,2,3 Santiago Martı́nez-Lumbreras,2,3 Panajot Kristofori,4 F.X. Reymond Sutandy,1
Lidia Llacsahuanga Allcca,1,13 Jonas Schönfeld,1 Cem Bakisoglu,5 Anke Busch,1 Heike Ha€nel,1 Kerstin Tretow,1
Mareen Welzel,1 Antonella Di Liddo,1 Martin M. Möckel,1 Kathi Zarnack,5,6 Ingo Ebersberger,7,8,9 Stefan Legewie,10,11
Katja Luck,1,* Michael Sattler,2,3,* and Julian König1,14,*
1Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany
2Institute of Structural Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
3Bavarian NMR Center, Department of Bioscience, School of Natural Sciences, Technical University of Munich, 85747 Garching, Germany
4Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, 70569 Stuttgart, Germany
5Buchmann Institute for Molecular Life Sciences & Institute of Molecular Biosciences, Goethe University Frankfurt, 60438 Frankfurt amMain,
Germany
6CardioPulmonary Institute (CPI), 35392 Gießen, Germany
7Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany
8Senckenberg Biodiversity and Climate Research Center (S-BIK-F), 60325 Frankfurt am Main, Germany
9LOEWE Center for Translational Biodiversity Genomics (TBG), 60325 Frankfurt am Main, Germany
10Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, 70569 Stuttgart, Germany
11Stuttgart Research Center for Systems Biology (SRCSB), University of Stuttgart, 70569 Stuttgart, Germany
12These authors contributed equally
13Present address: University of California, Berkeley, CA 94720, USA
14Lead contact
*Correspondence: k.luck@imb-mainz.de (K.L.), michael.sattler@helmholtz-munich.de (M.S.), j.koenig@imb-mainz.de (J.K.)
https://doi.org/10.1016/j.molcel.2023.07.002
SUMMARY
Splicing of pre-mRNAs critically contributes to gene regulation and proteome expansion in eukaryotes, but
our understanding of the recognition and pairing of splice sites during spliceosome assembly lacks detail.
Here, we identify the multidomain RNA-binding protein FUBP1 as a key splicing factor that binds to a hitherto
unknown cis-regulatory motif. By collecting NMR, structural, and in vivo interaction data, we demonstrate
that FUBP1 stabilizes U2AF2 and SF1, key components at the 30 splice site, through multivalent binding in-
terfaces located within its disordered regions. Transcriptional profiling and kinetic modeling reveal that
FUBP1 is required for efficient splicing of long introns, which is impaired in cancer patients harboring
FUBP1 mutations. Notably, FUBP1 interacts with numerous U1 snRNP-associated proteins, suggesting a
unique role for FUBP1 in splice site bridging for long introns. We propose a compelling model for 30 splice
site recognition of long introns, which represent 80% of all human introns.
INTRODUCTION tide,12,13 polypyrimidine (Py) tract,14–16 and branch point (BP)
site, respectively (Figure 1A).9,17 In the resulting A complex, U2
Splicing is a crucial step in eukaryotic mRNA processing, and its snRNP is recruited to the BP and stabilized by SF3A and
dysregulation is a hallmark of many cancers.1–3 Splicing is cata- SF3B, and SF1 is released.18,19 Subsequent snRNP recruitment
lyzed by the spliceosome, a megadalton machinery comprising and further rearrangements (formation of B and C complexes)
five small nuclear ribonucleoprotein (snRNP) complexes named mediate intron excision and exon ligation to form the
U1, U2, U4, U5, and U6.4–7 During early spliceosome assembly mature mRNA.
(E complex formation), the 50 and 30 splice sites are recognized: Strikingly, mechanistic details of splice site recognition by
U1 binds at the 50 splice site, whereas U2 auxiliary factor 1 multidomain splicing factors during early spliceosome assembly
(U2AF1), U2AF2, and splicing factor 1 (SF1) assemble at the 30 are lacking.20,21 U2AF2 binding is central to the early definition of
splice site,6–11 where they specifically recognize AG dinucleo- splice sites and is subject to layers of regulation including direct
Molecular Cell 83, 2653–2672, August 3, 2023 ª 2023 The Author(s). Published by Elsevier Inc. 2653
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
45
ll
OPEN ACCESS Article
A B
C
D E
Figure 1. FUBP1 binds upstream of the branch point at 30 splice sites during early spliceosome assembly in vivo
(A) Schematic of spatial RBP assembly at the 30 splice site in the ‘‘commitment’’ E complex and the pre-spliceosomal A complex. BP, branch point.
(B) iCLIP in HeLa cells. Distribution of binding sites across transcript regions for FUBP1 (n = 854,404), U2AF2 (n = 914,221), SF1 (n = 99,305), SF3B1
(n = 1,694,991), and PTBP1 (n = 127,450). 30 and 50 splice sites (ss) refer to 100 nt upstream/downstream of exons, respectively. CDS, coding sequence;
UTR, untranslated region.
(C) Metaprofiles of cross-link events of FUBP1, U2AF2, SF1, SF3B1, and PTBP1 relative to the BP.
(D) Genome browser view of an internal exon in the CPS1 mRNA displaying the iCLIP data for FUBP1, U2AF2, SF1, and SF3B1 from HeLa cells.
(E) Saturation analysis showing the percentage of bound 30 splice sites for each RBP in each quantile.
competition, cooperative recruitment, change of RNA secondary FUBP1was initially characterized as a transcriptional regulator
structure, dynamic conformational states, and autoinhibi- of the proto-oncogene c-myc through binding to AT-rich DNA
tion.15,22–30 Despite the pivotal role of U2AF2, the precise contri- elements and interaction with PUF60, also known as the
bution of cofactors and multivalent interactions are yet to be FUBP-interacting repressor (FIR).31–34 However, more recently,
elucidated. Recently, we reported how U2AF2 achieves speci- FUBP1 has also been reported to bind RNA and to influence
ficity despite the degeneracy of its pyrimidine-rich RNA-binding translation or splicing of specific transcripts.35–38 Similar to its
motif.28 In this study, we found that the RNA-binding protein DNA-binding specificity, FUBP1 exhibits a general preference
(RBP) far upstream binding protein 1 (FUBP1) promotes U2AF2 for AU- and GU-rich RNA31 that is expected to derive from its
binding to RNA. four K homology (KH) domains.39 Notably, cancer-associated
2654 Molecular Cell 83, 2653–2672, August 3, 2023
46
ll
Article OPEN ACCESS
A
B
C D
E F
G H I
Figure 2. FUBP1 binds a hitherto unknown cis-regulatory motif upstream of the BP
(A) Genome browser view of an internal exon in the VPS13D mRNA displaying the iCLIP data for FUBP1, U2AF2, SF1, and SF3B1.
(B) Domain architecture of FUBP1. KH, K homology domain; P-rich, proline-rich stretch.
(C) Agarose gel (left) and quantification with fitted curve (right) from an EMSA experiment with recombinant FUBP1N-box+KH (50–6,400 nM) and a fluorescently
labeled 132-nt RNA fragment of VPS13D (100 nM). Measurements were performed in duplicates and data are represented as mean ± standard deviation (SD).
(D) Binding affinity for the interaction of FUBP1KH with VPS13D RNA determined by ITC. ITC measurements were performed in triplicates and data are repre-
sented as mean ± SD.
(legend continued on next page)
Molecular Cell 83, 2653–2672, August 3, 2023 2655
47
ll
OPEN ACCESS Article
loss-of-function mutations within FUBP1 have been connected phoretic mobility shift assays (EMSAs) with a 132-nt RNA frag-
to global splicing changes in low-grade glioma,1,40–42 suggesting ment upstream of the prototypical 30 splice site of exon 43 of
an RNA-regulatory role in these processes. Here, we reveal a the VPS13D mRNA (VPS13D) and a shortened fragment (36 nt)
global role for FUBP1 in pre-mRNA splicing. Our results suggest with the region showing the most FUBP1 binding in iCLIP
that FUBP1 functions as a general splicing factor at the 30 splice (VPS13Dshort; Figure 2A). We observed strong binding of
site, with a crucial role in promoting efficient splicing of long in- FUBP1 (FUBP1N-box+KH, aa 1–457) to both RNAs in the low nano-
trons, whichmake up over 80%of human pre-mRNA transcripts. molar range (Figures 2B, 2C, and S1C). Isothermal titration calo-
rimetry (ITC) with VPS13D yielded a similar result (Figure 2D;
RESULTS Table S2), confirming the high-affinity binding at this region.
FUBP1 harbors four KH domains, which are expected to bind
FUBP1 is a core component of 30 splice site recognition single-stranded RNA and DNA32,52 and can act either indepen-
Todissect the role of FUBP1 in splicing,weexamined the footprint dently or synergistically53–55 to recognize extended regions of
of FUBP1 and other splicing factors on pre-mRNA in HeLa cells pre-mRNA. We used nuclear magnetic resonance (NMR) spec-
using in vivo individual-nucleotide resolution UV cross-linking troscopy to investigate the modular arrangement of the four
and immunoprecipitation (in vivo iCLIP; Figures 1B and S1A; FUBP1 KH domains. Superimposition showed that the NMR
Table S1).43,44 As expected, large proportions of the binding sites spectrum of FUBP1KH (aa 86–457) containing KH1–4 was virtu-
of SF1, U2AF2, and SF3B1 are located at 30 splice sites (10%, ally identical to those of the individual KH domains, indicating
17%, and 22%, respectively). Interestingly, FUBP1 shows a that the KH domains are structurally independent (Figure S1D).
similar preference for 30 splice sites (19%). By contrast, for the Furthermore, NMR secondary structure analysis revealed that
more restricted splicing regulator PTBP1, which is known to act FUBP1 contains KH domains with a typical type I fold that are
on a subset of exons, only 1% of binding sites are located at 30 connected by flexible linkers (Figure S1E).56 We conclude that
splicesites.Weconfirmed thatU2AF2bindsat thePy tract located the KH domains of FUBP1 are not preformed into an RNA-bind-
between theBPand30 splice site,45,46whereasSF1bindingpeaks ing platform but rather can be considered like beads on a string.
at the BP, with a reduced signal at the BP adenine itself,9,17 pre- To characterize the individual RNA-binding preferences of the
sumably owing to the lower cross-linking efficiency of adenine four KH domains, we performed a scaffold-independent analysis
(Figures 1C and 1D).47 Consistent with a previous report,48 (SIA), which is based on changes in NMR chemical shifts upon
SF3B1 binds in a clamp-wise manner up- and downstream of titration with short oligonucleotide motifs (Figure S2A).57 Initial
the BP. Strikingly, FUBP1 also shows a pronounced footprint at binding experiments were performed using randomized pools
the BP (Figures 1C and 1D). Its binding peaks at a location 34 nu- of 5-mer DNA, followed by verification of the identified motifs us-
cleotides (nt) upstream of the BP and tails for up to 100 nt. In com- ing RNA oligonucleotides (Figure S2B). SIA identified well-
parison, PTBP1 does not display such a ubiquitous positioning at defined consensus motifs for KH1 (UUUG) and KH2 (UUGU)
30 splice sites (Figure 1C).49,50Next,weaddressedwhat fractionof and more loosely defined motifs for KH3 (YBKK, where Y = C
30 splice sites is bound using a saturation-basedanalysis that con- or U; B = C, G, or U; K = G or U) and KH4 (YUKK). Hence, all
trols for splice site usage and transcript abundance.51 We found four KH domains exhibit a preference for GU-rich sequences
that FUBP1 binds the same percentage of 30 splice sites as (Figure 2E). The affinities of the individual KH domains to the final
U2AF2 and SF3B1, which are both universally present at 30 splice motifs, as determined by NMR spectroscopy, are in the high
sites (91.3%, 95.4%, and 99.6%, respectively; Figure 1E). By micromolar range (Figures S2C–S2F). Combinations of two KH
contrast, SF1 and PTBP1 are associated with 27.3% and 3.1% domains and motifs show strong binding avidity: the ITC-
of 30 splice sites, respectively (Figures 1E and S1B). Overall, these measured affinities for tandem domains were in the high nano-
data suggest that FUBP1 functions as a general splicing factor in molar to low micromolar range (Figures S2G–S2I; Table S2).
early spliceosome assembly. This suggests that specificity and high affinity are achieved by
avidity and multivalent interactions between the four KH do-
FUBP1 binds a cis-regulatory RNAmotif upstream of the mains and RNA with multiple binding motifs (Figure 2F). Indeed,
branch point EMSA and ITC experiments confirmed that multiple FUBP1
Given the prevalence of FUBP1 upstream of the BP, we investi- binding motifs in the VPS13D mRNA fragment increase FUBP1
gated its RNA-binding preferences. First, we performed electro- binding to nanomolar affinity (Figures 2A, 2C, and 2D).
(E) Scaffold-independent analysis (SIA)-derived binding motifs for individual FUBP1 KH domains. Preferred bases are highlighted in white. Y, pyrimidine (T or C);
B, not A (C, G, or T); K, keto (G or T).
(F) KD values of individual and tandem KH domains with their optimal DNA target (KH1, TTTTG; KH2, TTTGT; KH3, TCTGT; KH4, TTTTG; KH1-2,
TTTGTAAAATTTTG; KH2-3, TCTGTAAAATTTGT; KH3-4, TTTTGAAAATCTGT) determined by NMR or ITC, respectively (Figures S2C–S2I; Table S2). ITC
measurements were performed in triplicates. For NMR, the KD values of eight selected residues were calculated. Data are represented as mean ± SD.
(G) Motif enrichment in the in vivo FUBP1 iCLIP data. Disjunct 4-mer frequencies were calculated for the top vs bottom 20%of binding sites based on expression-
normalized iCLIP signals.
(H) Positional enrichment of FUBP1 binding motifs and control motifs relative to the BP. UUU+A/G/C, i.e., 4-mers containing UUU interspersed at any position
with A/G/C. NNNN, 100,000 sets of random combinations of four 4-mers. 4-mer frequencies were calculated position-wise upstream of the BP and compared
with the average 4-mer frequencies in an intronic control region. Top: Metaprofile of normalized FUBP1 and SF3B1 iCLIP cross-link events at the same 30 splice
sites is shown for comparison.
(I) Abundance of FUBP1 binding motifs at 30 splice sites of human introns. Background distribution for all possible 4-mers (mean ± 1 SD) is shown in gray.
2656 Molecular Cell 83, 2653–2672, August 3, 2023
48
ll
Article OPEN ACCESS
A C
B E
D
F G
H I
J
(legend on next page)
Molecular Cell 83, 2653–2672, August 3, 2023 2657
49
ll
OPEN ACCESS Article
Interrupting the U-rich motifs of VPS13Dshort with cytidines C-terminal region of SF1. Consistently, the SF1-FUBP1 interac-
severely reduces the binding affinity, underlining the specificity tion detected by BRET is reduced upon deletion or mutation of
of the FUBP1-RNA interaction (Figure S1C). the A/B boxes (Figures 3B–3D and S2M). In addition, 1H-15N cor-
To validate the interaction between FUBP1 and RNA motifs in relation NMR spectra of the FUBP1A/B box region show specific
cells, we compared 4-mer motifs in the sites of strongest FUBP1 chemical shift perturbations (CSPs) upon titration with the pro-
binding over background in the in vivo iCLIP data. In line with the line-rich region of SF1, indicating direct binding (Figure 3E).
SIA, we found a strong preference for uridine-rich motifs at To map the interacting regions in FUBP1 and U2AF2,
FUBP1 binding sites (Figures 2G and S2J). For in vivo binding, we performed NMR titration experiments using 15N-labeled
these motifs can be interspersed at any position by adenine or, U2AF2RRM12 14,15,30 and unlabeled full-length FUBP1. Large
to a lesser extent, by guanine. Consistent with the omnipresence CSPs and line broadening in the 1H-15N correlation spectra
of FUBP1 at 30 splice sites, we observed a striking enrichment of exclusively map to the U2AF2 RRM2 domain, especially to the
FUBP1 binding motifs (‘‘UUU+A’’ and ‘‘UUU+G,’’ i.e., three uri- two a helices on the backside of the b sheets that mediate
dines interspersed at any position with adenine or guanine) up- RNA binding (Figures 3B, 3F, S2N, and S3A). Moreover, a
stream of the BP, where they coincide with FUBP1 binding (Fig- construct comprising the N-terminal region of FUBP1
ure 2H). Conversely, both ‘‘UUU+C,’’ accounting for general (FUBP1N74, aa 1–74) recapitulates the CSPs observed with full-
uridine richness, and random motif sets are enriched closer to length FUBP1,whereas a construct lacking theN-terminal region
the 30 splice site but not in the main region of FUBP1 binding. (FUBP1DN, aa 75–644) does not yield any evident CSPs
Importantly, enriched FUBP1 motifs upstream of the BP are a (Figures 3F and S3A). Complementary NMR titrations with 15N-
common feature across all annotated introns (Figures 2H, 2I, labeled FUBP1 constructs identify the U2AF2 RRM2 domain
S2K, and S2L), indicating that we identified a previously un- and a short peptide motif in the N-terminal region of FUBP1
known cis-regulatory RNA motif in splicing regulation. (aa 27–52), referred to as N-box, as the minimal binding regions
(Figures S3B–S3F). The U2AF2 RRM2-FUBP1 N-box interaction
FUBP1 directly interacts with U2AF2 and SF1 exhibits micromolar affinity by NMR titrations (Figures 3G, S3D,
Given the prevalence of FUBP1 at functional 30 splice sites, we S3G, and S3H; Table S2).
examined whether FUBP1 interacts with key early 30 splice site To provide a high-resolution view, we determined the NMR-
components in cells using bioluminescence resonance energy derived solution structure of the U2AF2 RRM2-FUBP1 N-box
transfer (BRET) (Figure 3A).58 Interaction signals in the BRET complex (Figures 3H and S3I–S3K; Table 1). This structure
assay are indicative of direct contacts or close proximities. As shows a well-defined U2AF2 RRM2 domain and a more mobile
a proof-of-concept, we confirmed the known U2AF2-SF1 inter- helical FUBP1 N-box and reveals that the FUBP1N-box forms an
action.8,10,18,58 Importantly, we also observed interactions of a helix, which is recognized by helices a1 and a2 and the b4
FUBP1with U2AF2 andSF1 (Figures 3B, 3C, and S2M), suggest- strand of U2AF2RRM2. Hydrophobic interactions dominate at
ing that FUBP1 is in close or direct contact with these core this interface, where four alanines in FUBP1N-box (A30, A34,
splicing factors inside cells. A38, and A42) are aligned along the extended hydrophobic inter-
To investigate whether FUBP1 directly interacts with SF1 (Fig- face, with A38 positioned centrally. Additional contacts involving
ure 3C), we focused on the C-terminal region of FUBP1, which bulkier side chains, that is, R37 and I41 in FUBP1 and L278 and
harbors the A and B boxes (A/B boxes). Thesemotifs are specific M323 in U2AF2, further stabilize the binding interface. The
to the FUBP family of proteins and have been shown to mediate recognition of the FUBP1 N-box resembles the interaction be-
binding to a proline-rich region of snRNP-U1-70K in fruit tween FUBP1 N-box and PUF60,34 consistent with structural
flies.60,59 Similar proline-rich regions are also present in the similarities between PUF60 and U2AF2 RRM2 (Figure S4A).34
Figure 3. FUBP1 directly interacts with SF1 and U2AF2 via its C-terminal A/B boxes and N-terminal N-box
(A) Schematic of BRET assay. Energy transfer between the substrate oxidized by NanoLuc luciferase (donor, Don) andmCitrine (acceptor, Acc) occurs if proteins
X and Y interact.
(B) Domain architecture of U2AF2 (UniProt: P26368) and SF1 (UniProt: Q15637). ULM, U2AF ligand motif; RRM, RNA-recognition motif; UHM, U2AF homology
motif family; Qua2, quaking homology 2 domain; ZF, zinc finger.
(C) BRET values for tested interaction pairs and controls. Two biological replicates are shown. Error bars represent SD of technical triplicates. Trp-to-Arg
mutations in the A/B boxes were rationalized based on disrupting the hydrophobic contacts as previously reported.59
(D) BRET saturation curves for combinations of FUBP1 variants and wild-type SF1. Trp-to-Arg mutations in the A/B boxes or their deletion significantly lowered
the maximal BRET signal, although changes in the BRET50 (acceptor/donor ratio at which half-maximal BRET signal is reached) were not significant. Amounts of
acceptor and donor proteins were estimated by fluorescence and total luminescence, respectively, in intact cells. Two biological replicates are shown. Error bars
represent SD of technical triplicates.
(E) NMR titration of FUBP1A/B with SF1P-rich. Significant chemical shift changes are highlighted by boxes.
(F) Binding interface mapping based on NMR titration of U2AF2RRM12 with full-length FUBP1, FUBP1N74, and FUBP1DN (Figure S3A).
(G) Binding affinity for the interaction of FUBP1N-box and U2AF2RRM2 from NMR titrations. Chemical shift differences of four exemplary residues of FUBP1N-box
(Figures S3D and S3G) are fitted to binding isotherm to estimate the KD. Data are represented as mean ± SD of calculated KD values of eight selected residues.
(H) NMR-derived structure of the complex of U2AF2RRM2 (green) and FUBP1N-box (brown) (Figure S3K; Table 1, PDB: 8P25).
(I) Comparison of NMR titrations of FUBP1N-box WT and mutant FUBP1N-box+A38D with U2AF2RRM2.
(J) BRET saturation curves for wild-type FUBP1 andmutant FUBP1A38D against U2AF2. Two biological replicates are shown. Error bars represent SD of technical
triplicates.
2658 Molecular Cell 83, 2653–2672, August 3, 2023
50
ll
Article OPEN ACCESS
Table 1. Statistics for structure calculation of the U2AF2RRM2/ U2AF2 RRM2 (Figures S4C and S4D). A significant weakening
FUBP1N-box chimera, related to Figures 3H and S3K, PDB: 8P25a of the U2AF2-FUBP1 interaction by A38D in the full-length
Experimental restraints context was also confirmed in cells using BRET (Figures
3C, 3J, and S2M). Here, some residual binding between
Distance restraints
FUBP1A38D and U2AF2 was observed, probably because both
Total NOE 2,147 proteins remain in proximity through binding to the same pre-
Short range, |i–j| % 1 1,047 mRNAs. As expected, A38D does not affect FUBP1-SF1 bind-
Medium range, 1 < |i–j| < 5 392 ing, which occurs via the A/B boxes (Figures 3C, S4H, and
Long range, |i–j| R 5 708 S4I). In summary, our experiments demonstrate that FUBP1 in-
Dihedral angle restraints (from TALOS) teracts directly with U2AF2 and SF1 via its N-terminal N-box
and C-terminal A/B boxes, respectively. The former interaction
F 82
is severely impaired by a cancer-associated mutation in FUBP1.
J 86
Structure statistics FUBP1 promotes U2AF2 binding to 30 splice sites
RMSD from experimental restraints (mean and SD) To investigate the impact of FUBP1 on E complex formation, we
Distance restraints (Å), no violation > 0.5 Å 0.013 ± 0.007 monitored U2AF2 binding to RNA using in vitro iCLIP.
28 To this
Dihedral angle restraints (#, no violation > 0.5#) 0.19 ± 0.04 end, we designed a pool of short RNA transcripts (182 nt) repre-
senting !2,000 natural 30 splice sites from human transcripts,
Deviations from idealized geometry
which we mixed with recombinant U2AF2RRM12 (see STAR
Bond lengths (Å) 0.004 ± 0.0001 Methods). Remarkably, addition of recombinant full-length
Bond angles (#) 0.60 ± 0.01 FUBP1 (FUBP1FL) results in stronger binding of U2AF2RRM12 to
Impropers (#) 1.31 ± 0.04 virtually all 30 splice sites in the transcript pool (Figures 4A, 4B,
Average pairwise coordinate RMSD (Å) and S5A–S5C; Table S1). The in vivo pattern of U2AF2 binding
Backbone 0.92 ± 0.30 can thereby be reproduced in vitro in the presence of full-length
FUBP1 (Figure 4C). The widespread effects are in contrast to
Heavy atoms 1.41 ± 0.22
a those of our previous findings using in vitro-translated FUBP1,Pairwise coordinate root-mean-square deviation (RMSD) was calcu-
which affected only a few U2AF2 binding sites.28 Hence, our up-
lated for the 10 lowest-energy structures (regions 250–336 in
U2AF2RRM2 and 31–43 in FUBP1N-box) after water refinement. Rama- dated experiments indicate that FUBP1 acts globally to stabilize
chandran plot: 93.1%, 6.1%, 0.3%, and 0.4% of residues (regions 250– U2AF2 binding. We find that this effect is dependent on FUBP1
336 in U2AF2RRM2 and 31–43 in FUBP1N-Box) are found in the most concentration and is directly linked to the number of FUBP1
favored, additionally allowed, generously allowed, and disallowed binding motifs upstream of the BP (Figure 4D). To confirm these
regions. findings in longer transcripts, we repeated the experiment with a
pool of eight in vitro transcripts (2.0–5.7 kb; Figures S5D and
S5E; Table S1). Indeed, addition of recombinant full-length
Interestingly, both FUBP1 N-box-RRM interfaces show only FUBP1 increases the strength of U2AF2RRM12 binding at 30 splice
limited interdigitation of the hydrophobic side chains, consistent sites (Figures 4E and S5F) and thereby reproduces the in vivo
with the modest binding affinity in the micromolar range. binding pattern of U2AF2 (Figure 4F). Notably, this effect is
In a recent survey of The Cancer Genome Atlas (TCGA), considerably reduced with FUBP1DN (impaired U2AF2 interac-
FUBP1 was noted for its particularly high rate of non-synony- tion), and it is completely abolished with FUBP1N74 (lacking KH
mousmutations in low-grade gliomas.1 To learn about themech- domains). This highlights the importance of the N-box in
anistic impact of such mutations, we systematically searched FUBP1 for directly interacting with U2AF2 as well as of
cancer mutation databases and identified 26 disease-related FUBP1’s RNA binding for the stabilization of U2AF2
single-nucleotide variants (SNVs) within the FUBP1 N-box (Fig- (Figures 4F and S5F). Together, this indicates that the interaction
ure S4B). Five candidate mutations (A38D, A43E, K44R, I45F, of FUBP1 with both pre-mRNA and U2AF2 globally promotes
and G47C) were selected by considering the magnitude of U2AF2 binding at the 30 splice site during early spliceosomal
chemical shift changes occurring in the NMR titration of assembly.
FUBP1N-box with U2AF2RRM2 (Figures S3B–S3D and S4B). In
addition, we included L35V, which has been shown to weaken FUBP1 is critical for the splicing of long introns
the FUBP1-PUF60 interaction.61 NMR analysis revealed that To investigate the impact of FUBP1 on splicing, we generated a
A38D strongly impairs U2AF2 binding (Figures 3I and S4C– FUBP1 knockout (KO) RPE1 cell line using CRISPR-Cas9
S4G). This is consistent with our structure in which A38 forms genome engineering (Figures 5A and S5G) and performed
the core of the hydrophobic binding interface between FUBP1 RNA-seq. MYC gene expression was unaltered, suggesting
N-box and U2AF2. A bulkier negatively charged side chain in that it is not controlled by FUBP1 in RPE1 cells (Figure S5H).
this position is expected to introduce steric and electrostatic Next, we examined transcriptome-wide splicing and found
repulsion at the binding interface. Residue A38 in FUBP1 was 1,041 significant splicing changes, including 399 cassette exons
also required for binding to PUF60 in a mutational study,61 (Figure 5B; Tables S1 and S3). Consistent with a role in splice site
whereas L35V, which also affected the FUBP1-PUF60 interac- recognition, FUBP1KOpreferentially leads to exon skipping (276
tion in that study, did not impair the interaction of FUBP1 with [69%] with delta percent spliced in [DPSI] < "0.1).
Molecular Cell 83, 2653–2672, August 3, 2023 2659
51
ll
OPEN ACCESS Article
A B
C
D E
F
Figure 4. FUBP1 stabilizes U2AF2 binding at 30 splice sites in vitro
(A) Overview of FUBP1 protein variants used in in vitro iCLIP experiments.
(B) Scatterplot of in vitro iCLIP signal in U2AF2 binding sites of U2AF2RRM12 alone and upon addition of full-length FUBP1 on a pool of 1,998 in vitro transcripts.
(C) Genome browser view of LARP4mRNA displaying in vivo iCLIP for FUBP1 and U2AF2 and in vitro iCLIP on the respective in vitro transcript for U2AF2 alone
and after addition of full-length FUBP1.
(D) Number of FUBP1 bindingmotifs upstream of the BP (["100 nt;"26 nt]) in relation to the log2-transformed fold change of U2AF2RRM12 binding upon addition of
full-length FUBP1 for 1,504 30 splice sites in the in vitro transcripts.
(E) Metaprofile of U2AF2 binding at 30 splice sites from in vitro iCLIP with long in vitro transcripts28 and U2AF2RRM12 alone and after addition of FUBP1FL,
FUBP1N74, or FUBP1DN. iCLIP signals were normalized by spike-in and averaged per nucleotide over all introns (n = 21).
(F) Genome browser view of C4BPBmRNA displaying in vivo iCLIP for FUBP1 and U2AF2 and in vitro iCLIP for U2AF2RRM12 alone and after addition of FUBP1FL,
FUBP1N74 or FUBP1DN.
2660 Molecular Cell 83, 2653–2672, August 3, 2023
52
ll
Article OPEN ACCESS
A B
C D
E F
G
H I
J K
Figure 5. FUBP1 binds stronger to long introns and regulates exons flanked by long introns
(A) Western blot of FUBP1 in wild-type (WT), FUBP1-Nboxmut mutant, and FUBP1 KO RPE1 cells (Figure S5G). Vinculin acts as loading control.
(B) Minimum adjacent intron length for cassette exons more or less included upon FUBP1 KO in RPE1 cells (n = 123/276) and FUBP1 knockdown in K562 cells
(n = 30/143) compared to unchanged control exons (RPE1, n = 10,301; K562, n = 1,910). ***p < 0.001, ****p < 0.0001, n.s., not significant.
(legend continued on next page)
Molecular Cell 83, 2653–2672, August 3, 2023 2661
53
ll
OPEN ACCESS Article
A closer inspection revealed that the fate of an exon is related binding sites, the exon showed reduced inclusion (7%) and did
to the length of the flanking introns: decreased inclusion in not change in the FUBP1 KO. If the introns were shortened but
FUBP1 KO cells is typically observed for exons that are flanked the FUBP1 binding sites retained, the effect of FUBP1 KO or mu-
by longer introns, compared with exons with increased or un- tation was reduced, albeit still present, consistent with the notion
changed inclusion (Figure 5B, top). Most affected exons are that the intron is still perceived as long due to the presence of
alternative exons, but we observed the same effect for regulated FUBP1 binding site. By contrast, if the FUBP1 binding sites
constitutive exons (Figure S5I). Importantly, the effect on long in- were also removed, exon inclusion no longer responded to
trons can be recapitulated in ENCODE62,63 data on FUBP1 FUBP1 KO or FUBP1-Nboxmut, highlighting that FUBP1 binding
knockdown cells (Figure 5B, bottom). To test whether this de- is specifically required for the long-intron variant.
pends on the interaction with U2AF2, we generated a FUBP1- Intriguingly, the changes at long introns are linked to FUBP1
Nboxmut mutant with a targeted deletion of A38 and neighboring binding. We found a substantial increase in FUBP1 binding at
amino acids in the endogenous FUBP1 gene in RPE1 cells the 30 splice sites of longer introns, both in absolute terms and
(Figures 5A and S5G). Although overall fewer cassette exons relative to other splicing factors (Figures 5E and 5F). Differential
are regulated in this mutant (n = 81), exons are predominantly FUBP1 binding was not observed for other exon-intron-related
skipped (n = 45), and these are flanked by longer introns (Fig- features, such as splice site, Py tract, and BP strength
ure S5J). Together, these data reveal that FUBP1 is important (Figures S6E–S6H). Furthermore, longer introns exhibit a marked
for the splicing of long introns and suggest a functional role for enrichment of FUBP1motifs upstream of the BP (Figures 5G and
the N-box in this process. 5H). By contrast, random motif occurrences or splice site
To investigate whether FUBP1 mutations in tumor cells affect strength are independent of intron length (Figures S6I and
splicing, we analyzed data from glioma patients.1 Intriguingly, we S6J). Moreover, long introns were previously observed to prefer-
found that skipped exons in patients with FUBP1 loss-of-func- entially locate to the nuclear periphery and exhibit a differential
tion mutations have longer adjacent introns than exons dysregu- GC content architecture.64,65 Indeed, we found that the occur-
lated in patients harboring other splicing factor mutations rence of FUBP1 bindingmotifs correlates with theGCcontent ar-
(Figures 5C and S6A). The effect is also evident upon FUBP1 chitecture (Figures S6K–S6M). Furthermore, FUBP1 binds stron-
knockdown in the glioblastoma cell line U87MG from the same ger to introns located in the nuclear periphery (Figure S6N) and to
study (Figure 5C). Together, these data strongly suggest that splice sites of exons with differential GC content architecture
FUBP1 plays a role in the efficient splicing of long introns, (Figures S7A–S7C). Further analysis indicated that both intron
thereby affecting the inclusion of adjacent exons. length and differential GC content architecture affect FUBP1
To validate the role of FUBP1 for long introns, we constructed binding (Figure S7D).
a minigene for the alternative exon 18 in the MPDZ transcript, Although splicing is an ancient molecular mechanism, gene ar-
which is skipped upon FUBP1 KO in RPE1 cells. The minigene chitecture and especially intron length are subject to substantial
comprises the alternative exon with the flanking constitutive evolutionary change (Figure S7E). We hypothesized that FUBP1
exons and intervening long introns (>2.4 kb). In vivo iCLIP data is present throughout Eukaryota and that lineage-specific losses
show that FUBP1 binds at both 30 splice sites, which was or modifications of FUBP1 are accompanied by changes in
confirmed in vitro by EMSA with FUBP1N-box+KH (aa 1–457; average intron length. Indeed, we find overall that FUBP1 is well
Figures S6B and S6C). We observed amarked decrease of alter- conserved. Although losses do occur, they are mostly observed
native exon inclusion from the MPDZ minigene in FUBP1 KO in taxa with short introns such as protozoa and fungi (Figures 5I
(16% inclusion) and an intermediate effect (25%) in FUBP1- and 5J). Species with FUBP1 consistently harbor more FUBP1
Nboxmut cells, compared with wild-type (WT) cells (31% inclu- motifs at their 30 splice sites (Figure 5K). By contrast, U-richmotifs
sion; Figures 5D, S6B, and S6D). Upon mutation of the FUBP1 interspersed with C, which do not accumulate in the region of
(C) Junction length for less-included exons in RNA-seq from glioma patients with FUBP1 loss-of-function (LoF) mutations, from a FUBP1 siRNA knockdown in
U87MG cells, and from SF3B1/U2AF1/SRSF2 hotspot mutations and RBM10 LoF mutation in different cancer patient samples. ***p < 0.001.
(D) Changes of exon inclusion (n = 3) in FUBP1WT, FUBP1-Nboxmut, and FUBP1KORPE1 cell lines upon intron shortening and/or removal of FUBP1 binding sites
in theMPDZminigene (Figure S6B). Data are represented as mean ± SD. Significance was determined by a two-sided Student’s t-test with Benjamini-Hochberg
correction. Red dots represent FUBP1 binding sites. *p < 0.05, **p < 0.01, ***p < 0.001, n.s., not significant.
(E) Metaprofile showing FUBP1 cross-link events relative to branch point for various intron lengths. iCLIP signals were normalized for expression and averaged
per nucleotide over all introns.
(F) Quantification of binding signal based on area-under-the-curve (AUC) inmain binding regions (see STARMethods for details). Binding enrichment is defined as
log2 fold change of AUC over AUC of introns with length in (100 nt, 400 nt).
(G) Positional enrichment of FUBP1 binding motifs and control motifs relative to branch point and for various intron lengths. UUU+A/G/C, sets of four 4-mers
containing UUU interspersed at any position with A/G/C. NNNN, 100,000 sets of random combinations of four 4-mers. 4-mer frequencies were calculated
position-wise upstream of the BP and compared with average 4-mer frequencies in intronic control region.
(H) Number of FUBP1 binding motifs upstream of the BP (["100 nt; "26 nt]) for various intron lengths ([500, 1,000), n = 24,564 introns; [1,000, 2,000], n = 32,251
introns; [2,000, 4,000], n = 31,734 introns; [4,000, 17,000], n = 38,692).
(I) Phylogenetic profile of FUBP1. Tree indicates taxonomic range scanned for presence of FUBP1 orthologs. Fractions of species harboring ortholog to human
FUBP1 (left) and carrying the A/B boxes (right) are shown.
(J) FUBP1 presence compared to median intron length per species. ***p < 0.001.
(K) Percentage of introns with at least one FUBP1 motif or control motifs present in 25-nt window located 25 nt upstream of the 30 splice site.
2662 Molecular Cell 83, 2653–2672, August 3, 2023
54
ll
Article OPEN ACCESS
A B
C D E
F G
H I
(legend on next page)
Molecular Cell 83, 2653–2672, August 3, 2023 2663
55
ll
OPEN ACCESS Article
FUBP1 binding (Figure 5G), are least enriched in species with exons are not defined as functional units, and intron splicing
FUBP1. Comparing FUBP1’s domain architecture across eukary- solely requires U1 and U2 binding to flanking splice sites (Fig-
otic evolution, we find that C-terminal A/B boxes are an animal- ure S7G, ‘‘intron definition model’’). Taken together, the experi-
specific innovation. Their appearance in evolution is associated mental observations are consistent with the kinetic model, which
with an overall increase in intron length in animals compared assumes that FUBP1 differentially affects long introns by pro-
with other eukaryotes (Figure 5I). Together, this suggests that moting splice site pairing and the formation of catalytically active
FUBP1 binding to its RNAmotifs and its protein-protein interfaces spliceosomes across long introns.
play important roles in the splicing of long introns. To test this prediction, we investigated the cross-linking of
FUBP1 to snRNAs, indicative of its presence at different stages
FUBP1 interacts with both splice sites suggesting a of splicing. First, FUBP1 showed substantial cross-linking to U2
function in cross-intron bridging snRNA, consistent with FUBP1 binding upstream of the BPwhere
To decipher the molecular mechanism of FUBP1 action, we theU2 snRNP replacesSF1, indicating that FUBP1 is present dur-
developed a kinetic model of cassette exon splicing using ordi- ing A-complex formation (Figure 6C). More importantly, FUBP1
nary differential equations (Figures 6A and S7F; Table S4). In alsocross-links toU1snRNA,whichbinds to the50 splicesite, sug-
line with our previous work,66 we considered a scenario for gesting that FUBP1 is present during the bridging of the 30 and 50
‘‘exon definition’’ in which the U1 and U2 snRNPs recognize splice sites, either during initial exon definition or also at later
the 50 and 30 regions flanking an exon as functional units. The stages of intron definition. The latter is further supported by the
subsequent splice site pairing by U1/U2 snRNP interaction cross-linking of FUBP1 to U6 snRNA, which replaces U1 snRNA
across the intron, that is, intron definition, triggers splicing catal- at the 50 splice site prior to lariat formation (Figure 6C). Hence,
ysis, which results in either cassette exon inclusion, skipping, or FUBP1might be involved in intronbridging throughout the splicing
intron retention in the model. We first simulated the loss of cycle. We next searched our iCLIP datasets for evidence that
FUBP1 in a model in which FUBP1 solely acts on initial U1/U2 FUBP1 is still bound in the spliceosomalCcomplexwhen the lariat
snRNP binding to exons (exon definition). However, our simula- has formed after the first splicing reaction. It has been shown that
tions argue against a pure exon definition effect, as the model reads from the lariat truncateat thepositionwhere the50 splice site
cannot recapitulate the splicing changes that occur upon is covalently linked to the BP and is detected as a single-nucleo-
FUBP1 KO (Figure 6B; model 1). According to our experimental tide-wide peak at the 50 splice site (Figure 6D).68,69 Indeed, we
data, exons flanked by two long introns are typically skipped observed a strong peak in read truncations for FUBP1 at the 50
upon FUBP1 KO, whereas exons flanked by at least one short splice site, whereas there was almost no signal for the other
intron tend to show slightly increased inclusion. Surprisingly, splicing factors tested (Figure 6E). This suggests that FUBP1 is
the experimental data are more consistent with an alternative present from the early stages of spliceosome assembly until at
model in which FUBP1 enhances the pairing of splice sites least the first catalytic step of the splicing reaction.
across long (but not short) introns during intron definition. The To further investigate whether FUBP1 is actively involved in
model predicts reduced exon inclusion upon FUBP1 KO specif- splice-site bridging, we searched available binary protein-pro-
ically for exons flanked by two long introns, whereas exons tein interaction data from yeast two-hybrid screens.67 These
flanked by one short intron moderately increase, irrespective of data confirmed that FUBP1 binds to U2AF2 (Figure 6F). We
whether it is located upstream or downstream (Figure 6B; model also found evidence for FUBP1 interacting with several U1-asso-
2). These results also hold true in a modified model, in which ciated proteins (SNRPA, SNRPC, TIAL1, and PRPF40B) as well
Figure 6. FUBP1 interacts with U1 snRNP components
(A) Kinetic model of FUBP1’s effects on alternative splicing quantitatively describes steady-state abundance of splice products for a three-exon gene in control
and FUBP1 KO conditions. Two model variants were analyzed, in which FUBP1 affects the initial exon definition step near long introns (model 1), and the
subsequent splicing reaction, promoting the excision of long introns (model 2). See STAR Methods for details.
(B) Simulated splicing changes upon FUBP1KO reflect transcriptome-wide RNA-seq data assuming that FUBP1 affects splicing catalysis (model 2). To reflect the
heterogeneity of exons in the human transcriptome, kinetic parameters of the model were chosen at random, giving rise to an ensemble of 10,000 in silico exons.
FUBP1 KOwas simulated for each in silico exon, assuming that FUBP1 either enhances exon definition (model 1) or the rate of splicing (model 2) for long (but not
short) introns (see STARMethods for details). In the data, significantly regulated cassette exons were classified based on flanking intron lengths (<400 nt = short,
R400 nt = long).
(C) Fraction of total reads mapping to snRNAs using custom reference consisting of snRNAs (n = 10), tRNAs (n = 22), and rRNAs (n = 6).
(D) Schematic description of three-way junction of intron lariats. cDNAs can truncate not at the original protein-RNA interactions site but rather at the three-way
junction. These cDNAs either start from the intron end and truncate at the BP or, alternatively, start downstream of the 50 splice site and truncate at the first
nucleotide of the intron.
(E) Metaprofiles showing cross-link events of FUBP1, U2AF2, SF3B1, SF1, and PTBP1 relative to the 50 splice site. iCLIP signals were normalized for expression
and averaged per nucleotide.
(F) Comprehensive interaction network of FUBP1 based on NMR, BRET, and published yeast two-hybrid data.67
(G) BRET measurements between FUBP1 and subunits of the U1 snRNP complex as well as U1 snRNP-associated proteins along with positive and negative
control pairs. Biological replicates are shown. Error bars represent SD of technical triplicates.
(H) NMR titration of FUBP1A/B with SNRPBP-rich up to a molar ratio of 1:2. Significant chemical shift changes are highlighted by boxes.
(I) Percent-spliced-in (PSI) of MPDZ minigene upon transfection of WT and FUBP1 KO RPE1 cells with different FUBP1 constructs. Data are represented as
mean ± SD. Significance was determined by a two-sided Student’s t-test with Benjamini-Hochberg correction. *p < 0.05, **p < 0.01, ***p < 0.001, n.s., not
significant.
2664 Molecular Cell 83, 2653–2672, August 3, 2023
56
ll
Article OPEN ACCESS
A B
Figure 7. FUBP1 acts at multiple steps of early spliceosomal assembly
(A) The multiple roles of FUBP1 during spliceosomal complex assembly at the 30 splice site.
(B) FUBP1 directly interacts with U2AF2, SF1, and additional U1/U2 snRNP components via distinct disordered interaction interfaces.
as with SNRPB, which is a member of the Sm protein ring in all derstood. In this study, we identified FUBP1 as a key component
snRNPs (Figure 6F). These and further interactions of FUBP1 in 30 splice site definition. We found that FUBP1 recognizes clus-
with U1-associated proteins (TCERG1 and KHDRBS1) were tered U-rich elements interspersed by A or G that are present at
confirmed using the BRET assay and/or NMR (Figures 6F, 6G, virtually all 30 splice sites and are most abundant for longer in-
and S7H). Interestingly, several of the U1 snRNP-associated trons. Until now, four conserved intron-defining sequence motifs
proteins harbor proline-rich regions, which potentially interact were known: the 50 splice site motif, the BP sequence, the Py
with the A/B boxes in FUBP1, similar to the FUBP1-SF1 interac- tract, and the 30 splice site motif.6 We propose the FUBP1 bind-
tion discussed above. Indeed, we observed significant changes ing motif as a sequence signature that is relevant for spliceoso-
in the NMR spectrum of FUBP1A/B upon the addition of a proline- mal assembly at long introns, which represent >80% of all hu-
rich peptide from SNRPB (Figure 6H), which were less pro- man introns. Consistent with such a general role in splicing,
nounced with SNRPA and PRPF40B derivates (Figures S7I and FUBP1 has been detected in purified spliceosomes using
S7J). This correlates well with the proline-rich region in SNRPB mass spectrometry.70–72
being much larger than in SNRPA or PRPF40B and thus avidity We show that the four KH domains of FUBP1 recognize clus-
effects perhaps enhance the binding. tered arrays of binding motifs upstream of the BP. Multivalent in-
Finally, to confirm the importance of the FUBP1 A/B boxes and teractions enhance binding affinity by avidity and enable the
their role in splice-site bridging, we performed a complementa- recognition of cis-elements in RNAs of variable length by
tion assay by expressing full-length GFP-FUBP1 and different combining individual KH-RNA motif interactions where multiple
mutants in both WT and FUBP1 KO RPE1 cells. Effects on clustered RNA motifs may be separated by variable nucleotide
splicing were monitored using the co-transfected MPDZ mini- linkers.54 We find that the four KH domains are connected by
gene. As expected, GFP-FUBP1 complements the FUBP1 KO flexible linkers, which facilitates scanning of extended RNA re-
cells and rescues MPDZ exon inclusion close to WT levels gions. The recognition of clustered RNA motifs by multidomain
(Figures 6I, S7K, and S7L). Importantly, expression of GFP- RBPs has been observed in IMP proteins and also involves
FUBP1W586,615R (mutations in the A/B boxes) or FUBP1DC (com- four KH domains.55 This suggests that KH domains working in
plete deletion of the C terminus) impairs complementation in concert might be a common mechanism for specifically recog-
FUBP1 KO cells. The same was also observed if the interaction nizing clustered RNA motifs in extended RNA regions.
with U2AF2 is perturbed by expressing either FUBP1A38D (N-box
mutation) or FUBP1DN (complete deletion of the N terminus). FUBP1 engages inmultivalent interactions with 30 and 50
Overall, these data demonstrate that both the A/B boxes and splice site components
the N-box in FUBP1, which mediate the interactions with factors We characterized two interfaces in FUBP1 that mediate protein-
at the 50 and 30 splice sites, respectively, are functionally relevant protein interactions: the N-box and the A/B boxes that are
for splicing. embedded in the intrinsically disordered N- and C-terminal re-
gions of FUBP1, respectively. The N-box has been shown to
DISCUSSION interact with the RRM domain of PUF60 for regulation of tran-
scription.33,73,74 Here, we found that the FUBP1 N-box also
FUBP1 is a general component of 30 splice site definition binds to the RRM2 domain of U2AF2 and thereby mediates a
The recognition and pairing of splice sites, especially for the functional interaction during pre-mRNA splicing. The N-box
many long introns in the human transcriptome, are not well un- binds RRM2 opposite its RNA-binding surface, and thus, RNA
Molecular Cell 83, 2653–2672, August 3, 2023 2665
57
ll
OPEN ACCESS Article
binding and FUBP1 binding do not compete. Notably, we have to bring the splice sites together. Our data suggest that FUBP1—
previously shown that the U2AF2 tandem domains adopt closed through multivalent interactions with pre-mRNA, proteins, and
conformations and that RNA binding selects open arrange- snRNAs located at the 50 and 30 splice sites—adds to these con-
ments.15,29,75 Thus, binding of FUBP1 to the helical face of tacts throughout the splicing cycle. This ismost pertinent for long
U2AF2 RRM2might enhance RNA binding not only by stabilizing introns harboring multiple FUBP1 cis-regulatory motifs.
U2AF2 on the RNA but also by shifting the tandemRRMarrange- In conclusion, we identify FUBP1 as a general splicing factor
ments of U2AF2. that ubiquitously binds at 30 splice sites by means of a hitherto
The A/B boxes of FUBP1 interact with intrinsically disordered unknown cis-regulatory RNA sequence motif. The binding of
proline-rich sequences within several U1 and U2 snRNP-associ- FUBP1 and its interactions with multiple U1 and U2 snRNP com-
ated proteins. This matches observations on the A/B boxes of ponents are pertinent to the efficient splicing of long introns.
the FUBP1 ortholog PSI in Drosophila melanogaster, which
have been shown to bind to a proline-rich region in snRNP-U1- Limitations of the study
70K.59 However, this region is not conserved in the human ortho- Uridines are particularly prone to UV cross-linking, which can
log SNRNP70, and our BRET studies detected no such interac- introduce bias to motif identification by iCLIP. However, we
tion between FUBP1 and SNRNP70. In general, linear motifs in observed similar motifs using methods that do not involve UV
proline-rich regions are recognized by structured regions such cross-linking (NMR spectroscopy, ITC, and EMSA); therefore,
as WW or SH3 domains.76 These interactions are generally we are confident that our conclusions in this regard are valid.
weak but often enhanced by multivalent interactions.77–81 Inter- Upon depletion of FUBP1 in our KO or knockdown cell lines,
estingly, the A/B boxes are unique to the FUBP family and other factors (such as the close paralog KHSRP) might, to
appear to be unstructured regions in the ortholog PSI.59 It will some extent, take on the role of FUBP1. Together with cellular
be interesting to learn how prevalent such an atypical mode of quality control mechanisms that degrade mis-spliced tran-
proline-rich sequence binding is and how it impacts cellular scripts, this might reduce the effects of FUBP1 perturbation
function. that we observed in our RNA-seq analysis. We might clarify
such effects in the future by combing acute depletion of
FUBP1contributes to spliceosome formation andguides FUBP1 by means of degron tags with analysis of nascent RNA.
the splicing of long introns U2AF2 RRM2 and FUBP1 N-box interact with weak affinity in
One important question is why FUBP1 is particularly relevant for the micromolar range. Although it is likely that the simultaneous
long introns. Clearly, the splicing of long introns is difficult to binding of U2AF2 and FUBP1 to the RNA further stabilizes this
achieve. For instance, it has been reported that exons flanking interaction, we cannot exclude the involvement of other factors.
long introns are less included,82,83 and that the splice sites of In general, introns may be characterized by a multitude of fea-
longer introns are stronger.84,85 Consequently, longer introns tures, among which length is just one. For example, intron length
require more complex regulation, such as the switch from initial is known to correlate with elevated differential GC content and
exon definition to cross-intron spliceosomal complexes.84,86 overall lower intron and exon GC content.65 In addition, genes
During exon definition, splice sites are recognized and paired with longer introns have been shown to preferentially localize
across the exon, which is thereby defined as a functional unit. to the nuclear periphery,64 and their transcripts therefore might
During the subsequent switch to intron definition, the complex interact with different splicing factors than for genes at the nu-
shifts to a cross-intron pairing of splice sites (Figure 7). Our clear center. The question of whether these attributes rather
data suggest that FUBP1 acts at both steps. We propose that complement each other or are causally related remains to be
during exon definition, FUBP1 stabilizes U2AF2 and SF1 at the answered.
30 splice site. FUBP1 can thus strengthen the initial recognition
of 30 splice sites via its multivalent interactions with U2AF2, STAR+METHODS
SF1, and pre-mRNA. The stabilization by FUBP1 and its interac-
tions with theU1 snRNP across the exonmight thus contribute to Detailed methods are provided in the online version of this paper
splice site recognition during exon definition.86,87 and include the following:
The interactions between FUBP1 and U1 snRNP components
might also be relevant after the switch from exon definition to d KEY RESOURCES TABLE
cross-intron pairing. Consistent with this model, we found that d RESOURCE AVAILABILITY
FUBP1 is still present at splice sites until the lariat is formed. In B Lead contact
fact, FUBP1 forms cross-links to the U6 snRNA, which replaces B Materials availability
U1 snRNA at the 50 splice site. This indicates a role for FUBP1 in B Data and code availability
intron bridging during spliceosomal B-complex formation, d EXPERIMENTAL MODEL AND STUDY PARTICIPANT
particularly for long introns, as our experimental data and kinetic DETAILS
modeling suggest. B RPE1 cell lines and culture conditions
Several mechanisms and contributions to splice site bridging B HeLa cell line and culture conditions
have been suggested, for example, the interactions between B HEK cell line and culture conditions
U1 and U2 snRNP proteins and RNA components88–90 and the B Recombinant protein expression
U2AF-associated RNA helicase UAP56.91 It is conceivable that d METHOD DETAILS
multiple contact sites act in concert to generate sufficient avidity B Establishing FUBP1 KO/Nboxmut cell lines
2666 Molecular Cell 83, 2653–2672, August 3, 2023
58
ll
Article OPEN ACCESS
B Immunoblotting AUTHOR CONTRIBUTIONS
B RPE1 RNA-seq
S.E., C.B., A. Busch, and A.D.L. performed the bioinformatic analyses. C.H.,
B HeLa RNA-seq H.-S.K., and S.M.-L. performed the structural, biophysical, and biochemical
B Semi-quantitative RT-PCR experiments and analyses. M.M. Mulorz, A. Buchbender, F.X.R.S., L.L.A.,
B In vivo iCLIP H.H., K.T., andM.M.Möckel performed the functional genomics, in vitro iCLIP,
B In vitro iCLIP and minigene reporter experiments. D.H., J.S., and M.W. performed the BRET
B Protein expression and purification experiments. P.K. and S.L. performed the mathematical modeling. I.E. per-
NMR spectroscopy formed the evolutionary analysis. S.E., C.H., M.M. Mulorz, K.Z., K.L., M.S.,B
and J.K. designed the study and wrote the manuscript. All authors read and
B In vitro binding assays
commented on the manuscript.
B BRET
d QUANTIFICATION AND STATISTICAL ANALYSIS DECLARATION OF INTERESTS
B Preprocessing of RNA-seq data
Preprocessing of in vivo iCLIP data The authors declare no competing interests.B
B Metaprofiles for in vivo iCLIP data
Received: January 4, 2023
B iCLIP binding site definition (peak calling) Revised: May 19, 2023
B Saturation analysis Accepted: July 3, 2023
B Motif enrichment for in vivo iCLIP Published: July 27, 2023
B Motif enrichment upstream of branch points
B Abundance of FUBP1 motif at 30 splice sites REFERENCES
B Analysis of in vitro iCLIP data
1. Seiler, M., Peng, S., Agrawal, A.A., Palacino, J., Teng, T., Zhu, P., Smith,
B Intron length analyses of RNA-seq data
P.G., Cancer; Genome; Atlas; Research Network, Buonamici, S., and Yu,
B ENCODE data analysis L. (2018). Somatic mutational landscape of splicing factor genes and
B Splicing changes upon FUBP1 LoF mutations their functional consequences across 33 cancer types. Cell Rep. 23,
B Mutations in FUBP1 in cancer patients 282–296.e4. https://doi.org/10.1016/j.celrep.2018.01.088.
B Scoring of splice site features 2. Bonnal, S.C., López-Oreja, I., and Valcárcel, J. (2020). Roles and mech-
B Evolutionary analyses anisms of alternative splicing in cancer – implications for care. Nat. Rev.
B Analysis of RBP crosslinking to snRNAs Clin. Oncol. 17, 457–474. https://doi.org/10.1038/s41571-020-0350-x.
B Subnuclear distribution of FUBP1-bound genes 3. Gebauer, F., Schwarzl, T., Valcárcel, J., and Hentze, M.W. (2021). RNA-
B Mathematical modeling binding proteins in human genetic disease. Nat. Rev. Genet. 22,
185–198. https://doi.org/10.1038/s41576-020-00302-y.
4. Shi, Y. (2017). Mechanistic insights into precursor messenger RNA
SUPPLEMENTAL INFORMATION
splicing by the spliceosome. Nat. Rev. Mol. Cell Biol. 18, 655–670.
https://doi.org/10.1038/nrm.2017.86.
Supplemental information can be found online at https://doi.org/10.1016/j.
molcel.2023.07.002. 5. Wilkinson, M.E., Charenton, C., and Nagai, K. (2020). RNA splicing by the
spliceosome. Annu. Rev. Biochem. 89, 359–388. https://doi.org/10.
1146/annurev-biochem-091719-064225.
ACKNOWLEDGMENTS
6. Wahl, M.C., Will, C.L., and Lu€hrmann, R. (2009). The spliceosome: design
We thank all themembers of the Luck, Sattler, and König labs for their help and principles of a dynamic RNPmachine. Cell 136, 701–718. https://doi.org/
discussion. We thankMalgorzata Rogalska and Juan Valcárcel for discussions 10.1016/j.cell.2009.02.009.
and comments on the manuscript, Philipp Trepte and the Wanker group for 7. Papasaikas, P., and Valcárcel, J. (2016). The spliceosome: the ultimate
sharing protocols and reagents and for help in setting up BRET assays, Chris- RNA chaperone and sculptor. Trends Biochem. Sci. 41, 33–45. https://
tian Scha€fer for help with BRET assays, Eric Schumbera for help with BRET doi.org/10.1016/j.tibs.2015.11.003.
data processing, Fridolin Kielisch for help with statistical analyses,Mario Keller 8. Berglund, J.A., Abovich, N., and Rosbash, M. (1998). A cooperative inter-
for bioinformatics advice, André Mourão for SNRPBP-rich plasmid, Sam Asami action between U2AF65 and mBBP/SF1 facilitates branchpoint region
and Gerd Gemmecker for support with NMR experiments, Manuel Kaulich for recognition. Genes Dev. 12, 858–867. https://doi.org/10.1101/gad.12.
reagents, and Chris Smith and Jernej Ule for PTBP1-RB40 antibody and rese- 6.858.
quencing. We thank Adrian Neal for editing and commenting on the manu- 9. Liu, Z., Luyten, I., Bottomley, M.J., Messias, A.C., Houngninou-Molango,
script. We thank the Core Facilities at IMB, in particular Protein Production, Mi- S., Sprangers, R., Zanier, K., Kra€mer, A., and Sattler, M. (2001). Structural
croscopy, Bioinformatics, Genomics, and Flow Cytometry. basis for recognition of the intron branch site RNA by splicing factor 1.
We acknowledge IMB Genomics Core Facility and its NextSeq 500 Science 294, 1098–1102. https://doi.org/10.1126/science.1064719.
sequencer (funded by the Deutsche Forschungsgemeinschaft [DFG, German
10. Selenko, P., Gregorovic, G., Sprangers, R., Stier, G., Rhani, Z., Kra€mer,
Research Foundation] INST 247/870-1 FUGG) and access to NMR spectrom-
A., and Sattler, M. (2003). Structural basis for the molecular recognition
eters at Bavarian NMRCenter. This work was supported by DFG grants to K.L.
between human splicing factors U2AF65 and SF1/mBBP. Mol. Cell 11,
(LU 2568/1-1; SFB1551 Project no. 464588647), J.K. (SPP1935 Project no.
965–976. https://doi.org/10.1016/s1097-2765(03)00115-1.
273941853, KO4566/2-1, SFB1551 Project No. 464588647, TRR 319 Project
no. 439669440, and GRK2526/1 Project no. 407023052), K.Z. (SPP1935 Proj- 11. Kielkopf, C.L., Rodionova, N.A., Green, M.R., and Burley, S.K. (2001). A
ect no. 273941853), S.L. (LE 3473/2–3), and M.S. (SPP1935 Project no. novel peptide recognition mode revealed by the X-ray structure of a core
273941853, SA823/10-1, and SFB1035 Project no. 201302640). C.H. ac- U2AF35/U2AF65 heterodimer. Cell 106, 595–605. https://doi.org/10.
knowledges the Fonds der Chemischen Industrie for Kekulé fellowship, and 1016/s0092-8674(01)00480-9.
S.M.-L. acknowledges EUHorizon 2020 Research and Innovation program un- 12. Wu, S., Romfo, C.M., Nilsen, T.W., and Green, M.R. (1999). Functional
der the Marie Sk1odovska-Curie grant agreement No. 792692. J.S. acknowl- recognition of the 30 splice site AG by the splicing factor U2AF35.
edges a PhD stipend from IMB’s collaborative research initiative. Nature 402, 832–835. https://doi.org/10.1038/45590.
Molecular Cell 83, 2653–2672, August 3, 2023 2667
59
ll
OPEN ACCESS Article
13. Merendino, L., Guth, S., Bilbao, D., Martı́nez, C., and Valcárcel, J. (1999). 29. Voith von Voithenberg, L., Sánchez-Rico, C., Kang, H.-S., Madl, T.,
Inhibition of msl-2 splicing by Sex-lethal reveals interaction between Zanier, K., Barth, A., Warner, L.R., Sattler, M., and Lamb, D.C. (2016).
U2AF35 and the 30 splice site AG. Nature 402, 838–841. https://doi. Recognition of the 30 splice site RNA by the U2AF heterodimer involves
org/10.1038/45602. a dynamic population shift. Proc. Natl. Acad. Sci. USA 113. E7169–
14. Agrawal, A.A., Salsi, E., Chatrikhi, R., Henderson, S., Jenkins, J.L., E7175. https://doi.org/10.1073/pnas.1605873113.
Green, M.R., Ermolenko, D.N., and Kielkopf, C.L. (2016). An extended 30. Kang, H.-S., Sánchez-Rico, C., Ebersberger, S., Sutandy, F.X.R., Busch,
U2AF(65)–RNA-binding domain recognizes the 30 splice site signal. A., Welte, T., Stehle, R., Hipp, C., Schulz, L., Buchbender, A., et al. (2020).
Nat. Commun. 7, 10950. https://doi.org/10.1038/ncomms10950. An autoinhibitory intramolecular interaction proof-reads RNA recognition
15. Mackereth, C.D., Madl, T., Bonnal, S., Simon, B., Zanier, K., Gasch, A., by the essential splicing factor U2AF2. Proc. Natl. Acad. Sci. USA 117,
Rybin, V., Valcárcel, J., and Sattler, M. (2011). Multi-domain conforma- 7140–7149. https://doi.org/10.1073/pnas.1913483117.
tional selection underlies pre-mRNA splicing regulation by U2AF. 31. Debaize, L., and Troadec, M.-B. (2019). The master regulator FUBP1: its
Nature 475, 408–411. https://doi.org/10.1038/nature10171. emerging role in normal cell function and malignant development. Cell.
Mol. Life Sci. 76, 259–281. https://doi.org/10.1007/s00018-018-2933-6.
16. Zamore, P.D., and Green, M.R. (1989). Identification, purification, and
biochemical characterization of U2 small nuclear ribonucleoprotein auxil- 32. Duncan, R., Bazar, L., Michelotti, G., Tomonaga, T., Krutzsch, H., Avigan,
iary factor. Proc. Natl. Acad. Sci. USA 86, 9243–9247. https://doi.org/10. M., and Levens, D. (1994). A sequence-specific, single-strand binding
1073/pnas.86.23.9243. protein activates the far upstream element of c-myc and defines a new
DNA-binding motif. Genes Dev. 8, 465–480. https://doi.org/10.1101/
17. Berglund, J.A., Chua, K., Abovich, N., Reed, R., and Rosbash, M. (1997).
gad.8.4.465.
The splicing factor BBP interacts specifically with the pre-mRNA branch-
point sequence UACUAAC. Cell 89, 781–787. https://doi.org/10.1016/ 33. Liu, J., Kouzine, F., Nie, Z., Chung, H.-J., Elisha-Feil, Z., Weber, A., Zhao,
s0092-8674(00)80261-5. K., and Levens, D. (2006). The FUSE/FBP/FIR/TFIIH system is a molecu-
lar machine programming a pulse of c-myc expression. EMBO J. 25,
18. Crisci, A., Raleff, F., Bagdiul, I., Raabe, M., Urlaub, H., Rain, J.-C., and
2119–2130. https://doi.org/10.1038/sj.emboj.7601101.
Kra€mer, A. (2015). Mammalian splicing factor SF1 interacts with SURP
domains of U2 snRNP-associated proteins. Nucleic Acids Res. 43, 34. Cukier, C.D., Hollingworth, D., Martin, S.R., Kelly, G., Dı́az-Moreno, I.,
10456–10473. https://doi.org/10.1093/nar/gkv952. and Ramos, A. (2010). Molecular basis of FIR-mediated c-myc transcrip-
tional control. Nat. Struct. Mol. Biol. 17, 1058–1064. https://doi.org/10.
19. Wahl, M.C., and Lu€hrmann, R. (2015). SnapShot: spliceosome dynamics
1038/nsmb.1883.
I. Cell 161, 1474–1474e1. https://doi.org/10.1016/j.cell.2015.05.050.
35. Li, H., Wang, Z., Zhou, X., Cheng, Y., Xie, Z., Manley, J.L., and Feng, Y.
20. Tholen, J., and Galej, W.P. (2022). Structural studies of the spliceosome: (2013). Far upstream element-binding protein 1 and RNA secondary
bridging the gaps. Curr. Opin. Struct. Biol. 77, 102461. https://doi.org/10. structure both mediate second-step splicing repression. Proc. Natl.
1016/j.sbi.2022.102461. Acad. Sci. USA 110. E2687–E2695. https://doi.org/10.1073/pnas.
21. Ule, J., and Blencowe, B.J. (2019). Alternative splicing regulatory net- 1310607110.
works: functions, mechanisms, and evolution. Mol. Cell 76, 329–345. 36. Hwang, I., Cao, D., Na, Y., Kim, D.-Y., Zhang, T., Yao, J., Oh, H., Hu, J.,
https://doi.org/10.1016/j.molcel.2019.09.017. Zheng, H., Yao, Y., and Paik, J. (2018). Far upstream element-binding
22. Zuo, P., and Maniatis, T. (1996). The splicing factor U2AF35 mediates protein 1 regulates LSD1 alternative splicing to promote terminal differ-
critical protein-protein interactions in constitutive and enhancer-depen- entiation of neural progenitors. Stem Cell Reports 10, 1208–1221.
dent splicing. Genes Dev. 10, 1356–1368. https://doi.org/10.1101/gad. https://doi.org/10.1016/j.stemcr.2018.02.013.
10.11.1356. 37. Jacob, A.G., Singh, R.K., Mohammad, F., Bebee, T.W., and Chandler,
23. Saulière, J., Sureau, A., Expert-Bezançon, A., and Marie, J. (2006). The D.S. (2014). The splicing factor FUBP1 is required for the efficient splicing
polypyrimidine tract binding protein (PTB) represses splicing of exon of oncogene MDM2 pre-mRNA. J. Biol. Chem. 289, 17350–17364.
6B from the beta-tropomyosin pre-mRNA by directly interfering with https://doi.org/10.1074/jbc.M114.554717.
the binding of the U2AF65 subunit. Mol. Cell. Biol. 26, 8755–8769. 38. Miro, J., Laaref, A.M., Rofidal, V., Lagrafeuille, R., Hem, S., Thorel, D.,
https://doi.org/10.1128/MCB.00893-06. Méchin, D., Mamchaoui, K., Mouly, V., Claustres, M., and Tuffery-
24. Soares, L.M.M., Zanier, K., Mackereth, C., Sattler, M., and Valcárcel, J. Giraud, S. (2015). FUBP1: a new protagonist in splicing regulation of
(2006). Intron removal requires proofreading of U2AF/30 splice site recog- the DMD gene. Nucleic Acids Res. 43, 2378–2389. https://doi.org/10.
nition by DEK. Science 312, 1961–1965. https://doi.org/10.1126/sci- 1093/nar/gkv086.
ence.1128659. 39. Ni, X., Knapp, S., and Chaikuad, A. (2020). Comparative structural ana-
25. Warf, M.B., Diegel, J.V., von Hippel, P.H., and Berglund, J.A. (2009). The lyses and nucleotide-binding characterization of the four KH domains
protein factors MBNL1 and U2AF65 bind alternative RNA structures to of FUBP1. Sci. Rep. 10, 13459. https://doi.org/10.1038/s41598-020-
regulate splicing. Proc. Natl. Acad. Sci. USA 106, 9203–9208. https:// 69832-z.
doi.org/10.1073/pnas.0900342106. 40. Wang, H., Zhang, R., Li, E., Yan, R., Ma, B., and Ma, Q. (2022). Pan-can-
26. Tavanez, J.P., Madl, T., Kooshapur, H., Sattler, M., and Valcárcel, J. cer transcriptome and immune infiltration analyses reveal the oncogenic
(2012). hnRNP A1 proofreads 30 splice site recognition by U2AF. Mol. role of far upstream element-binding protein 1 (FUBP1). Front. Mol.
Cell 45, 314–329. https://doi.org/10.1016/j.molcel.2011.11.033. Biosci. 9, 794715. https://doi.org/10.3389/fmolb.2022.794715.
27. Zarnack, K., König, J., Tajnik, M., Martincorena, I., Eustermann, S., 41. Elman, J.S., Ni, T.K., Mengwasser, K.E., Jin, D., Wronski, A., Elledge,
Stévant, I., Reyes, A., Anders, S., Luscombe, N.M., and Ule, J. (2013). S.J., and Kuperwasser, C. (2019). Identification of FUBP1 as a long tail
Direct competition between hnRNP C and U2AF65 protects the tran- cancer driver and widespread regulator of tumor suppressor and onco-
scriptome from the exonization of Alu elements. Cell 152, 453–466. gene alternative splicing. Cell Rep. 28, 3435–3449.e5. https://doi.org/
https://doi.org/10.1016/j.cell.2012.12.023. 10.1016/j.celrep.2019.08.060.
28. Sutandy, F.X.R., Ebersberger, S., Huang, L., Busch, A., Bach, M., Kang, 42. Wang, J., Schultz, P.G., and Johnson, K.A. (2018). Mechanistic studies of
H.-S., Fallmann, J., Maticzka, D., Backofen, R., Stadler, P.F., et al. (2018). a small-moleculemodulator of SMN2 splicing. Proc. Natl. Acad. Sci. USA
In vitro iCLIP-based modeling uncovers how the splicing factor U2AF2 115. E4604–E4612. https://doi.org/10.1073/pnas.1800260115.
relies on regulation by cofactors. Genome Res. 28, 699–713. https:// 43. König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner,
doi.org/10.1101/gr.229757.117. D.J., Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of
2668 Molecular Cell 83, 2653–2672, August 3, 2023
60
ll
Article OPEN ACCESS
hnRNP particles in splicing at individual nucleotide resolution. Nat. mammalian cells. Mol. Syst. Biol. 14, e8071. https://doi.org/10.15252/
Struct. Mol. Biol. 17, 909–915. https://doi.org/10.1038/nsmb.1838. msb.20178071.
44. Buchbender, A., Mutter, H., Sutandy, F.X.R., Körtel, N., Ha€nel, H., Busch, 59. Ignjatovic, T., Yang, J.-C., Butler, J., Neuhaus, D., and Nagai, K. (2005).
A., Ebersberger, S., and König, J. (2020). Improved library preparation Structural basis of the interaction between P-element somatic inhibitor
with the new iCLIP2 protocol. Methods 178, 33–48. https://doi.org/10. and U1-70k essential for the alternative splicing of P-element transpo-
1016/j.ymeth.2019.10.003. sase. J. Mol. Biol. 351, 52–65. https://doi.org/10.1016/j.jmb.2005.
04.077.
45. Valcárcel, J., Gaur, R.K., Singh, R., and Green, M.R. (1996). Interaction of
U2AF65 RS region with pre-mRNA branch point and promotion of base 60. Labourier, E., Adams, M.D., and Rio, D.C. (2001). Modulation of
pairing with U2 snRNA [corrected]. Science 273, 1706–1709. https:// P-element pre-mRNA splicing by a direct interaction between PSI and
doi.org/10.1126/science.273.5282.1706. U1 snRNP 70K protein. Mol. Cell 8, 363–373. https://doi.org/10.1016/
s1097-2765(01)00311-2.
46. Singh, R., Valcárcel, J., and Green, M.R. (1995). Distinct binding specific-
ities and functions of higher eukaryotic polypyrimidine tract-binding 61. Chung, H.-J., Liu, J., Dundr, M., Nie, Z., Sanford, S., and Levens, D.
proteins. Science 268, 1173–1176. https://doi.org/10.1126/science. (2006). FBPs are calibrated molecular tools to adjust gene expression.
7761834. Mol. Cell. Biol. 26, 6584–6597. https://doi.org/10.1128/MCB.00754-06.
47. Sugimoto, Y., König, J., Hussain, S., Zupan, B., Curk, T., Frye, M., and 62. ENCODE Project Consortium (2012). An integrated encyclopedia of DNA
Ule, J. (2012). Analysis of CLIP and iCLIP methods for nucleotide-resolu- elements in the human genome. Nature 489, 57–74. https://doi.org/10.
tion studies of protein-RNA interactions. Genome Biol. 13, R67. https:// 1038/nature11247.
doi.org/10.1186/gb-2012-13-8-r67. 63. Luo, Y., Hitz, B.C., Gabdank, I., Hilton, J.A., Kagda,M.S., Lam, B., Myers,
48. Gozani, O., Potashkin, J., and Reed, R. (1998). A potential role for U2AF- Z., Sud, P., Jou, J., Lin, K., et al. (2020). New developments on the
SAP 155 interactions in recruiting U2 snRNP to the branch site. Mol. Cell. Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids
Biol. 18, 4752–4760. https://doi.org/10.1128/MCB.18.8.4752. Res. 48. D882–D889. https://doi.org/10.1093/nar/gkz1062.
64. Tammer, L., Hameiri, O., Keydar, I., Roy, V.R., Ashkenazy-Titelman, A.,
49. Xue, Y., Zhou, Y., Wu, T., Zhu, T., Ji, X., Kwon, Y.-S., Zhang, C., Yeo, G.,
Custódio, N., Sason, I., Shayevitch, R., Rodrı́guez-Vaello, V., Rino, J.,
Black, D.L., Sun, H., et al. (2009). Genome-wide analysis of PTB-RNA in-
et al. (2022). Gene architecture directs splicing outcome in separate nu-
teractions reveals a strategy used by the general splicing repressor to
clear spatial regions. Mol. Cell 82, 1021–1034.e8. https://doi.org/10.
modulate exon inclusion or skipping. Mol. Cell 36, 996–1006. https://
1016/j.molcel.2022.02.001.
doi.org/10.1016/j.molcel.2009.12.003.
65. Amit, M., Donyo, M., Hollander, D., Goren, A., Kim, E., Gelfman, S., Lev-
50. Llorian, M., Schwartz, S., Clark, T.A., Hollander, D., Tan, L.-Y., Spellman,
Maor, G., Burstein, D., Schwartz, S., Postolsky, B., et al. (2012).
R., Gordon, A., Schweitzer, A.C., de la Grange, P., Ast, G., and Smith,
Differential GC content between exons and introns establishes distinct
C.W.J. (2010). Position-dependent alternative splicing activity revealed
strategies of splice-site recognition. Cell Rep. 1, 543–556. https://doi.
by global profiling of alternative splicing events regulated by PTB. Nat.
org/10.1016/j.celrep.2012.03.013.
Struct. Mol. Biol. 17, 1114–1123. https://doi.org/10.1038/nsmb.1881.
66. Enculescu, M., Braun, S., Thonta Setty, S., Busch, A., Zarnack, K., König,
51. Shao, C., Yang, B.,Wu, T., Huang, J., Tang, P., Zhou, Y., Zhou, J., Qiu, J.,
0 J., and Legewie, S. (2020). Exon definition facilitates reliable control ofJiang, L., Li, H., et al. (2014). Mechanisms for U2AF to define 3 splice
alternative splicing in the RON proto-oncogene. Biophys. J. 118, 2027–
sites and regulate alternative splicing in the human genome. Nat.
2041. https://doi.org/10.1016/j.bpj.2020.02.022.
Struct. Mol. Biol. 21, 997–1005. https://doi.org/10.1038/nsmb.2906.
67. Luck, K., Kim, D.-K., Lambourne, L., Spirohn, K., Begg, B.E., Bian, W.,
52. Valverde, R., Edwards, L., and Regan, L. (2008). Structure and function of
Brignall, R., Cafarelli, T., Campos-Laborie, F.J., Charloteaux, B., et al.
KH domains. FEBS J. 275, 2712–2726. https://doi.org/10.1111/j.1742-
(2020). A reference map of the human binary protein interactome.
4658.2008.06411.x.
Nature 580, 402–408. https://doi.org/10.1038/s41586-020-2188-x.
53. Fukumura, K., Yoshimoto, R., Sperotto, L., Kang, H.-S., Hirose, T., Inoue, 68. Briese, M., Haberman, N., Sibley, C.R., Faraway, R., Elser, A.S.,
K., Sattler, M., and Mayeda, A. (2021). SPF45/RBM17-dependent, but Chakrabarti, A.M., Wang, Z., König, J., Perera, D., Wickramasinghe,
not U2AF-dependent, splicing in a distinct subset of human short introns. V.O., et al. (2019). A systems view of spliceosomal assembly and branch-
Nat. Commun. 12, 4910. https://doi.org/10.1038/s41467-021-24879-y. points with iCLIP. Nat. Struct. Mol. Biol. 26, 930–940. https://doi.org/10.
54. Mackereth, C.D., and Sattler, M. (2012). Dynamics in multi-domain pro- 1038/s41594-019-0300-4.
tein recognition of RNA. Curr. Opin. Struct. Biol. 22, 287–296. https:// 69. Cordiner, R.A., Dou, Y., Thomsen, R., Bugai, A., Granneman, S., and
doi.org/10.1016/j.sbi.2012.03.013. Heick Jensen, T. (2023). Temporal-iCLIP captures co-transcriptional
55. Schneider, T., Hung, L.-H., Aziz, M., Wilmen, A., Thaum, S., Wagner, J., RNA-protein interactions. Nat. Commun. 14, 696. https://doi.org/10.
Janowski, R., Mu€ller, S., Schreiner, S., Friedhoff, P., et al. (2019). 1038/s41467-023-36345-y.
Combinatorial recognition of clustered RNA elements by themultidomain 70. Rappsilber, J., Ryder, U., Lamond, A.I., andMann,M. (2002). Large-scale
RNA-binding protein IMP3. Nat. Commun. 10, 2266. https://doi.org/10. proteomic analysis of the human spliceosome. Genome Res. 12, 1231–
1038/s41467-019-09769-8. 1245. https://doi.org/10.1101/gr.473902.
56. Siomi, H., Matunis, M.J., Michael, W.M., and Dreyfuss, G. (1993). The 71. Makarov, E.M., Owen, N., Bottrill, A., and Makarova, O.V. (2012).
pre-mRNA binding K protein contains a novel evolutionarily conserved Functional mammalian spliceosomal complex E contains SMN complex
motif. Nucleic Acids Res. 21, 1193–1198. https://doi.org/10.1093/nar/ proteins in addition to U1 and U2 snRNPs. Nucleic Acids Res. 40, 2639–
21.5.1193. 2652. https://doi.org/10.1093/nar/gkr1056.
57. Beuth, B., Garcı́a-Mayoral, M.F., Taylor, I.A., and Ramos, A. (2007). 72. Sharma, S., Kohlstaedt, L.A., Damianov, A., Rio, D.C., and Black, D.L.
Scaffold-independent analysis of RNA-protein interactions: the Nova-1 (2008). Polypyrimidine tract binding protein controls the transition from
KH3-RNA complex. J. Am. Chem. Soc. 129, 10205–10210. https://doi. exon definition to an intron defined spliceosome. Nat. Struct. Mol. Biol.
org/10.1021/ja072365q. 15, 183–191. https://doi.org/10.1038/nsmb.1375.
58. Trepte, P., Kruse, S., Kostova, S., Hoffmann, S., Buntru, A., 73. Hsiao, H.-H., Nath, A., Lin, C.-Y., Folta-Stogniew, E.J., Rhoades, E., and
Tempelmeier, A., Secker, C., Diez, L., Schulz, A., Klockmeier, K., et al. Braddock, D.T. (2010). Quantitative characterization of the interactions
(2018). LuTHy: a double-readout bioluminescence-based two-hybrid among c-myc transcriptional regulators FUSE, FBP, and FIR.
technology for quantitative mapping of protein-protein interactions in Biochemistry 49, 4620–4634. https://doi.org/10.1021/bi9021445.
Molecular Cell 83, 2653–2672, August 3, 2023 2669
61
ll
OPEN ACCESS Article
74. Liu, J., He, L., Collins, I., Ge, H., Libutti, D., Li, J., Egly, J.M., and Levens, pre-mRNA splicing. RNA Biol. 18, 2576–2593. https://doi.org/10.1080/
D. (2000). The FBP interacting repressor targets TFIIH to inhibit activated 15476286.2021.1932360.
transcription. Mol. Cell 5, 331–341. https://doi.org/10.1016/s1097- 92. Linares, A.J., Lin, C.-H., Damianov, A., Adams, K.L., Novitch, B.G., and
2765(00)80428-1. Black, D.L. (2015). The splicing regulator PTBP1 controls the activity of
75. Huang, J.-R., Warner, L.R., Sanchez, C., Gabel, F., Madl, T., Mackereth, the transcription factor Pbx1 during neuronal differentiation. ELife 4,
C.D., Sattler, M., and Blackledge, M. (2014). Transient electrostatic inter- e09268. https://doi.org/10.7554/eLife.09268.
actions dominate the conformational equilibrium sampled by multido- 93. Delaglio, F., Grzesiek, S., Vuister, G.W., Zhu, G., Pfeifer, J., and Bax, A.
main splicing factor U2AF65: a combined NMR and SAXS study. (1995). NMRPipe: a multidimensional spectral processing system based
J. Am. Chem. Soc. 136, 7068–7076. https://doi.org/10.1021/ja502030n. on UNIX pipes. J. Biomol. NMR 6, 277–293. https://doi.org/10.1007/
76. Macias, M.J., Wiesner, S., and Sudol, M. (2002). WW and SH3 domains, BF00197809.
two different scaffolds to recognize proline-rich ligands. FEBS Lett. 513, 94. Lee, W., Tonelli, M., and Markley, J.L. (2015). NMRFAM-SPARKY:
30–37. https://doi.org/10.1016/s0014-5793(01)03290-2. enhanced software for biomolecular NMR spectroscopy. Bioinformatics
77. Ball, L.J., Ku€hne, R., Schneider-Mergener, J., and Oschkinat, H. (2005). 31, 1325–1327. https://doi.org/10.1093/bioinformatics/btu830.
Recognition of proline-rich motifs by protein-protein-interaction do- 95. Gu€ntert, P. (2009). Automated structure determination from NMR
mains. Angew. Chem. Int. Ed. Engl. 44, 2852–2869. https://doi.org/10. spectra. Eur. Biophys. J. 38, 129–143. https://doi.org/10.1007/s00249-
1002/anie.200400618. 008-0367-z.
78. Zarrinpar, A., Bhattacharyya, R.P., and Lim, W.A. (2003). The structure 96. Shen, Y., Delaglio, F., Cornilescu, G., and Bax, A. (2009). TALOS+: a
and function of proline recognition domains. Sci. STKE 2003, RE8. hybrid method for predicting protein backbone torsion angles from
https://doi.org/10.1126/stke.2003.179.re8. NMR chemical shifts. J. Biomol. NMR 44, 213–223. https://doi.org/10.
79. Kofler, M.M., and Freund, C. (2006). The GYF domain. FEBS J. 273, 1007/s10858-009-9333-z.
245–256. https://doi.org/10.1111/j.1742-4658.2005.05078.x. 97. Rieping, W., Habeck, M., Bardiaux, B., Bernard, A., Malliavin, T.E., and
80. Sudol, M. (1996). Structure and function of the WW domain. Prog. Nilges, M. (2007). ARIA2: automated NOE assignment and data integra-
Biophys. Mol. Biol. 65, 113–132. https://doi.org/10.1016/s0079- tion in NMR structure calculation. Bioinformatics 23, 381–382. https://
6107(96)00008-9. doi.org/10.1093/bioinformatics/btl589.
81. Mayer, B.J. (2001). SH3 domains: complexity in moderation. J. Cell Sci. 98. Laskowski, R.A., Rullmannn, J.A., MacArthur, M.W., Kaptein, R., and
114, 1253–1263. https://doi.org/10.1242/jcs.114.7.1253. Thornton, J.M. (1996). Aqua and PROCHECK-NMR: programs for check-
82. Bell, M.V., Cowper, A.E., Lefranc, M.P., Bell, J.I., and Screaton, G.R. ing the quality of protein structures solved by NMR. J. Biomol. NMR 8,
(1998). Influence of intron length on alternative splicing of CD44. Mol. 477–486. https://doi.org/10.1007/BF00228148.
Cell. Biol. 18, 5930–5941. https://doi.org/10.1128/MCB.18.10.5930. 99. Bhattacharya, A., Tejero, R., andMontelione, G.T. (2007). Evaluating pro-
83. Fox-Walsh, K.L., Dou, Y., Lam, B.J., Hung, S.-P., Baldi, P.F., and Hertel, tein structures determined by structural genomics consortia. Proteins 66,
K.J. (2005). The architecture of pre-mRNAs affects mechanisms 778–795. https://doi.org/10.1002/prot.21165.
of splice-site pairing. Proc. Natl. Acad. Sci. USA 102, 16176–16181. 100. Koradi, R., Billeter, M., and Wu€thrich, K. (1996). MOLMOL: A program for
https://doi.org/10.1073/pnas.0508489102. display and analysis of macromolecular structures. J. Mol. Graph. 14,
84. Dewey, C.N., Rogozin, I.B., and Koonin, E.V. (2006). Compensatory rela- 51–55. https://doi.org/10.1016/0263-7855(96)00009-4.
tionship between splice sites and exonic splicing signals depending on 101. Schrödinger, L., and DeLano, W. (2020). PyMOL. http://www.pymol.
the length of vertebrate introns. BMC Genomics 7, 311. https://doi.org/ org/pymol.
10.1186/1471-2164-7-311. 102. Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M.,
85. Gelfman, S., Burstein, D., Penn, O., Savchenko, A., Amit, M., Schwartz, Pietzsch, T., Preibisch, S., Rueden, C., Saalfeld, S., Schmid, B., et al.
S., Pupko, T., and Ast, G. (2012). Changes in exon-intron structure during (2012). Fiji: an open-source platform for biological-image analysis. Nat.
vertebrate evolution affect the splicing pattern of exons. Genome Res. Methods 9, 676–682. https://doi.org/10.1038/nmeth.2019.
22, 35–50. https://doi.org/10.1101/gr.119834.110. 103. Coleman, T., Branch, M.A., and Grace, A. (1999). Optimization Toolbox.
86. De Conti, L., Baralle, M., and Buratti, E. (2013). Exon and intron definition For Use with MATLAB. User’s guide. The MathWorks Inc, Ver. 2.
in pre-mRNA splicing. Wiley Interdiscip. Rev. RNA 4, 49–60. https://doi. 104. R Core Team (2016). R: A Language and Environment for Statistical
org/10.1002/wrna.1140. Computing. R Foundation for Statistical Computing. http://www.R-
87. Schneider, M., Will, C.L., Anokhina, M., Tazi, J., Urlaub, H., and project.org/.
Lu€hrmann, R. (2010). Exon definition complexes contain the tri-snRNP 105. Vaquero-Garcia, J., Barrera, A., Gazzara, M.R., González-Vallinas, J.,
and can be directly converted into B-like precatalytic splicing complexes. Lahens, N.F., Hogenesch, J.B., Lynch, K.W., and Barash, Y. (2016). A
Mol. Cell 38, 223–235. https://doi.org/10.1016/j.molcel.2010.02.027. new view of transcriptome complexity and regulation through the lens
88. Sharma, S., Wongpalee, S.P., Vashisht, A., Wohlschlegel, J.A., and of local splicing variations. ELife 5, e11752. https://doi.org/10.7554/
Black, D.L. (2014). Stem-loop 4 of U1 snRNA is essential for splicing eLife.11752.
and interacts with the U2 snRNP-specific SF3A1 protein during spliceo- 106. Dosch, J., Bergmann, H., Tran, V., and Ebersberger, I. (2023). FAS: as-
some assembly. Genes Dev. 28, 2518–2531. https://doi.org/10.1101/ sessing the similarity between proteins using multi-layered feature archi-
gad.248625.114. tectures. Bioinformatics 39, btad226. https://doi.org/10.1093/bioinfor-
89. Martelly, W., Fellows, B., Senior, K., Marlowe, T., and Sharma, S. (2019). matics/btad226.
Identification of a noncanonical RNA binding domain in the U2 107. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S.,
snRNP protein SF3A1. RNA 25, 1509–1521. https://doi.org/10.1261/ Batut, P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast uni-
rna.072256.119. versal RNA-seq aligner. Bioinformatics 29, 15–21. https://doi.org/10.
90. Plaschka, C., Lin, P.-C., Charenton, C., and Nagai, K. (2018). 1093/bioinformatics/bts635.
Prespliceosome structure provides insights into spliceosome assembly 108. Martin, M. (2011). Cutadapt removes adapter sequences from high-
and regulation. Nature 559, 419–422. https://doi.org/10.1038/s41586- throughput sequencing reads. EMBnet J. 17, 10–12. https://doi.org/10.
018-0323-8. 14806/ej.17.1.200.
91. Martelly, W., Fellows, B., Kang, P., Vashisht, A., Wohlschlegel, J.A., and 109. Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard,
Sharma, S. (2021). Synergistic roles for human U1 snRNA stem-loops in M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., and Li,
2670 Molecular Cell 83, 2653–2672, August 3, 2023
62
ll
Article OPEN ACCESS
H. (2021). Twelve years of SAMtools and BCFtools. Gigascience 10, 125. Zwahlen, C., Gardner, K.H., Sarma, S.P., Horita, D.A., Byrd, R.A., and
giab008. https://doi.org/10.1093/gigascience/giab008. Kay, L.E. (1998). An NMR experiment for measuring methyl-methyl
110. Liao, Y., Smyth, G.K., and Shi, W. (2014). featureCounts: an efficient gen- NOEs in
13C-labeled proteins with high resolution. J. Am. Chem. Soc.
eral purpose program for assigning sequence reads to genomic features. 120, 7617–7625. https://doi.org/10.1021/ja981205z.
Bioinformatics 30, 923–930. https://doi.org/10.1093/bioinformatics/ 126. Marsh, J.A., Singh, V.K., Jia, Z., and Forman-Kay, J.D. (2006). Sensitivity
btt656. of secondary structure propensities to sequence differences between
alpha- and gamma-synuclein: implications for fibrillation. Protein Sci.
111. Roehr, J.T., Dieterich, C., and Reinert, K. (2017). Flexbar 3.0 - SIMD and
15, 2795–2804. https://doi.org/10.1110/ps.062465306.
multicore parallelization. Bioinformatics 33, 2941–2942. https://doi.org/
10.1093/bioinformatics/btx330. 127. Linge, J.P., Williams, M.A., Spronk, C.A.E.M., Bonvin, A.M.J.J., and
Nilges, M. (2003). Refinement of protein structures in explicit solvent.
112. Krakau, S., Richard, H., and Marsico, A. (2017). PureCLIP: capturing
Proteins 50, 496–506. https://doi.org/10.1002/prot.10299.
target-specific protein-RNA interaction footprints from single-nucleotide
CLIP-seq data. Genome Biol. 18, 240. https://doi.org/10.1186/s13059- 128. Bru€nger, A.T., Adams, P.D., Clore, G.M., DeLano,W.L., Gros, P., Grosse-
017-1364-2. Kunstleve, R.W., Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N.S., et al.
(1998). Crystallography & NMR system: a new software suite for macro-
113. Lorenz, R., Bernhart, S.H., Höner Zu Siederdissen, C., Tafer, H., Flamm,
molecular structure determination. Acta Crystallogr. D Biol. Crystallogr.
C., Stadler, P.F., and Hofacker, I.L. (2011). ViennaRNA package 2.0.
54, 905–921. https://doi.org/10.1107/s0907444998003254.
Algorithms Mol. Biol. 6, 26. https://doi.org/10.1186/1748-7188-6-26.
129. Messias, A.C., and Sattler, M. (2004). Structural basis of single-stranded
114. Huppertz, I., Attig, J., D’Ambrogio, A., Easton, L.E., Sibley, C.R., RNA recognition. Acc. Chem. Res. 37, 279–287. https://doi.org/10.1021/
Sugimoto, Y., Tajnik, M., König, J., and Ule, J. (2014). iCLIP: protein– ar030034m.
RNA interactions at nucleotide resolution. Methods 65, 274–287.
https://doi.org/10.1016/j.ymeth.2013.10.011. 130. Wiemann, S., Pennacchio, C., Hu, Y., Hunter, P., Harbers, M., Amiet, A.,
Bethel, G., Busse, M., Carninci, P., Dunham, I., et al. (2016). The
115. Spellman, R., Llorian, M., and Smith, C.W.J. (2007). Crossregulation and ORFeome Collaboration: a genome-scale human ORF-clone resource.
functional redundancy between the splicing regulator PTB and its paral- Nature Methods 13, 191–192.
ogs nPTB and ROD1. Mol. Cell 27, 420–434. https://doi.org/10.1016/j.
131. Frankish, A., Diekhans, M., Ferreira, A.-M., Johnson, R., Jungreis, I.,
molcel.2007.06.016.
Loveland, J., Mudge, J.M., Sisu, C., Wright, J., Armstrong, J., et al.
116. Coelho, M.B., Attig, J., Bellora, N., König, J., Hallegger, M., Kayikci, M., (2019). GENCODE reference annotation for the human and mouse ge-
Eyras, E., Ule, J., and Smith, C.W.J. (2015). Nuclear matrix protein nomes. Nucleic Acids Res. 47. D766–D773. https://doi.org/10.1093/
Matrin3 regulates alternative splicing and forms overlapping regulatory nar/gky955.
networks with PTB. EMBO J. 34, 653–668. https://doi.org/10.15252/
132. Busch, A., Bru€ggemann, M., Ebersberger, S., and Zarnack, K. (2020).
embj.201489852.
iCLIP data analysis: a complete pipeline from sequencing reads to
117. Grzesiek, S., and Bax, A. (1992). Correlating backbone amide and side RBP binding sites. Methods 178, 49–62. https://doi.org/10.1016/j.
chain resonances in larger proteins by multiple relayed triple resonance ymeth.2019.11.008.
NMR. J. Am. Chem. Soc. 114, 6291–6293. https://doi.org/10.1021/
133. Paggi, J.M., and Bejerano, G. (2018). A sequence-based, deep learning
ja00042a003.
model accurately predicts RNA splicing branchpoints. RNA 24, 1647–
118. Sattler, M., Schleucher, J., and Griesinger, C. (1999). Heteronuclear 1658. https://doi.org/10.1261/rna.066290.118.
multidimensional NMR experiments for the structure determination of
134. Hinrichs, A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G.,
proteins in solution employing pulsed field gradients. Prog. Nucl.
Clawson, H., Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F., et al.
Magn. Reson. Spectrosc. 34, 93–158. https://doi.org/10.1016/s0079-
(2006). The UCSC genome browser database: update 2006. Nucleic
6565(98)00025-9.
Acids Res. 34. D590–D598. https://doi.org/10.1093/nar/gkj144.
119. Wishart, D.S., and Sykes, B.D. (1994). The 13C chemical-shift index: a 135. Green, C.J., Gazzara, M.R., and Barash, Y. (2018). MAJIQ-SPEL: web-
simple method for the identification of protein secondary structure using tool to interrogate classical and complex splicing variations from RNA-
13C chemical-shift data. J. Biomol. NMR 4, 171–180. https://doi.org/10. Seq data. Bioinformatics 34, 300–302. https://doi.org/10.1093/bioinfor-
1007/BF00175245. matics/btx565.
120. Saitô, H. (1986). Conformation-dependent 13C chemical shifts: a new 136. Norton, S.S., Vaquero-Garcia, J., Lahens, N.F., Grant, G.R., and Barash,
means of conformational characterization as obtained by high-resolution Y. (2018). Outlier detection for improved differential splicing quantifica-
solid-state 13C NMR. Magn. Reson. Chem. 24, 835–852. https://doi.org/ tion from RNA-Seq experiments with replicates. Bioinformatics 34,
10.1002/mrc.1260241002. 1488–1497. https://doi.org/10.1093/bioinformatics/btx790.
121. Kjaergaard, M., and Poulsen, F.M. (2011). Sequence correction of 137. Zhang, J., Bajari, R., Andric, D., Gerthoffert, F., Lepsa, A., Nahal-Bose,
random coil chemical shifts: correlation between neighbor correction H., Stein, L.D., and Ferretti, V. (2019). The International Cancer
factors and changes in the Ramachandran distribution. J. Biomol. NMR Genome Consortium data portal. Nat. Biotechnol. 37, 367–369. https://
50, 157–165. https://doi.org/10.1007/s10858-011-9508-2. doi.org/10.1038/s41587-019-0055-9.
122. Farrow, N.A., Muhandiram, R., Singer, A.U., Pascal, S.M., Kay, C.M., 138. Cerami, E., Gao, J., Dogrusoz, U., Gross, B.E., Sumer, S.O., Aksoy, B.A.,
Gish, G., Shoelson, S.E., Pawson, T., Forman-Kay, J.D., and Kay, L.E. Jacobsen, A., Byrne, C.J., Heuer, M.L., Larsson, E., et al. (2012). The cBio
(1994). Backbone dynamics of a free and phosphopeptide-complexed cancer genomics portal: an open platform for exploring multidimensional
Src homology 2 domain studied by 15N NMR relaxation. Biochemistry cancer genomics data. Cancer Discov. 2, 401–404. https://doi.org/10.
33, 5984–6003. https://doi.org/10.1021/bi00185a040. 1158/2159-8290.CD-12-0095.
123. Mulder, F.A., Schipper, D., Bott, R., and Boelens, R. (1999). Altered flex- 139. Gao, J., Aksoy, B.A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S.O.,
ibility in the substrate-binding site of related native and engineered high- Sun, Y., Jacobsen, A., Sinha, R., Larsson, E., et al. (2013). Integrative
alkalineBacillus subtilisins. J. Mol. Biol. 292, 111–123. https://doi.org/10. analysis of complex cancer genomics and clinical profiles using
1006/jmbi.1999.3034. the cBioPortal. Sci. Signal. 6, pl1. https://doi.org/10.1126/scisignal.
124. Williamson, M.P. (2013). Using chemical shift perturbation to character- 2004088.
ise ligand binding. Prog. Nucl. Magn. Reson. Spectrosc. 73, 1–16. 140. Karczewski, K.J., Francioli, L.C., Tiao, G., Cummings, B.B., Alföldi, J.,
https://doi.org/10.1016/j.pnmrs.2013.02.001. Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., Birnbaum, D.P.,
Molecular Cell 83, 2653–2672, August 3, 2023 2671
63
ll
OPEN ACCESS Article
et al. (2020). Themutational constraint spectrum quantified from variation dence. Nucleic Acids Res. 46. D1062–D1067. https://doi.org/10.1093/
in 141,456 humans. Nature 581, 434–443. https://doi.org/10.1038/ nar/gkx1153.
s41586-020-2308-7. 144. Yeo, G., and Burge, C.B. (2004). Maximum entropy modeling of short
141. Tate, J.G., Bamford, S., Jubb, H.C., Sondka, Z., Beare, D.M., Bindal, N., sequence motifs with applications to RNA splicing signals. J. Comput.
Boutselakis, H., Cole, C.G., Creatore, C., Dawson, E., et al. (2019). Biol. 11, 377–394. https://doi.org/10.1089/1066527041410418.
COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids 145. Birikmen, M., Bohnsack, K.E., Tran, V., Somayaji, S., Bohnsack, M.T.,
Res. 47. D941–D947. https://doi.org/10.1093/nar/gky1015. and Ebersberger, I. (2021). Tracing eukaryotic ribosome biogenesis
142. Grossman, R.L., Heath, A.P., Ferretti, V., Varmus, H.E., Lowy, D.R., factors into the archaeal domain sheds light on the evolution of functional
Kibbe, W.A., and Staudt, L.M. (2016). Toward a shared vision for cancer complexity. Front. Microbiol. 12, 739000. https://doi.org/10.3389/fmicb.
genomic data. N. Engl. J. Med. 375, 1109–1112. https://doi.org/10.1056/ 2021.739000.
NEJMp1607591. 146. Smith, T., Heger, A., and Sudbery, I. (2017). UMI-tools: modeling
143. Landrum, M.J., Lee, J.M., Benson, M., Brown, G.R., Chao, C., sequencing errors in Unique Molecular Identifiers to improve quantifica-
Chitipiralla, S., Gu, B., Hart, J., Hoffman, D., Jang, W., et al. (2018). tion accuracy. Genome Res. 27, 491–499. https://doi.org/10.1101/gr.
ClinVar: improving access to variant interpretations and supporting evi- 209601.116.
2672 Molecular Cell 83, 2653–2672, August 3, 2023
64
ll
Article OPEN ACCESS
STAR+METHODS
KEY RESOURCES TABLE
REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
Rabbit anti-FUBP1 GeneTex Cat# GTX104579; RRID: AB_11165485
Mouse anti-U2AF2 Sigma-Aldrich Cat# U4758;
RRID: AB_262122
Mouse anti-SF3B1 MBL Cat# D221-3; RRID: AB_592712
Mouse anti-SF1 Abnova Cat# H00007536-M01A; RRID: AB_10774630
rabbit anti-PTBP1 Christopher Smith Linares et al.92
Mouse anti-vinculin Sigma-Aldrich Cat# V9264; RRID: AB_10603627
Goat anti-rabbit IgG, HRP-linked Cell Signaling Cat# 7074S; RRID: AB_2099233
Horse anti-mouse IgG, HRP-linked Cell Signaling Cat# 7076S; RRID: AB_330924
Bacterial and virus strains
DH5alpha Invitrogen Cat# 18265017
MACH1 Invitrogen Cat# C862003
E. coli BL21-CodonPlus (DE3)-RIL Agilent Cat# 230245
E. coli BL21 (DE3) Sigma-Aldrich Cat# CMC0014
Chemicals, peptides, and recombinant proteins
FUGENE HD reagent Promega Cat# E2311
Lipofectamine CRISPRMAX reagent Thermo Fisher Cat# CMAX00001
Lipofectamine RNAimax Thermo Fisher Cat# 13778150
Lipofectamine 2000 Invitrogen Cat# 11668019
cOmplete Protease-Inhibitor Mix Sigma-Aldrich Cat# 4693159001
TURBO DNase Thermo Fisher Cat# AM2238
SuperSignal West PICO Chemiluminescent Substrate Thermo Fisher Cat# 15626144
4-thiouridine Sigma-Aldrich Cat# T4509-25MG
T4 RNA ligase New England Biolabs Cat# M0202S
T4 RNA ligase 1 New England Biolabs Cat# M0437M
pCp-Cy5 Jena Bioscience Cat# NU-1706-CY5
T7 RNA polymerase Geerlof A., Protein Expression and N/A
Purification Facility, HMGU Munich
Pfu DNA Polymerase Promega Cat# M7741
OneTaq DNA Polymerase New England Biolabs Cat# M0480S
Phusion High-Fidelity DNA Polymerase New England Biolabs Cat# M0530S
Critical commercial assays
TranscriptAid Enzyme Mix Thermo Fisher Cat# K0441
GeneArt Genomic Cleavage Detection Assay Thermo Fisher Cat# A24372
Zero Blunt TOPO PCR Cloning Kit Thermo Fisher Cat# 451245
RNeasy PLUS Mini Kit Qiagen Cat# 74034
TruSeq library preparation Kit ‘‘Ribo-Zero Gold’’ Illumina Cat# 20040526
RevertAid First Strand cDNA Synthesis Kit Thermo Fisher Cat# 10161310
Q5 Site-Directed Mutagenesis Kit New England Biolabs Cat# E0552S
High Sensitivity D1000 ScreenTape Agilent Cat# 5067-5584
High Sensitivity RNA ScreenTape Agilent Cat# 5067-5579
NuPAGE 1 mm, 4-12% Bis-Tris Mini Protein Gel Thermo FIsher Cat# 12090156
HiScribe T7 High Yield RNA Synthesis Kit New England Biolabs Cat# E2040S
(Continued on next page)
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e1
65
ll
OPEN ACCESS Article
Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
ProNex Dual Size-Selective Purification System Promega Cat# NG2002
BP clonase II mix kit Invitrogen Cat# 10348582
LR clonase technology Invitrogen Cat# 11791020
Deposited data
in vitro and in vivo iCLIP and RNA-Seq data This study GEO: GSE220186
Kinetic modeling of cassette exon splicing This study https://doi.org/10.5281/zenodo.8076768
Protein structure data This study PDB: 8P25
NMR data This study BMRB: 34816
Original Western blot, gel images and capillary This study, Mendeley Data https://doi.org/10.17632/nj8ybm8vb2.1
electrophoresis images
RNA-Seq data: control and shRNA knockdown Luo et al.63, ENCODE: ENCSR260BQC (control) and
for FUBP1 in K562 cells ENCODE Project Consortium62 ENCSR608IXR (FUBP1 KD)
Differentially spliced junctions in splicing factor Seiler et al.1 Table S3 in Seiler et al.
mutations
Experimental models: Cell lines
human: HeLa ATCC Cat# CCL-2, RRID:CVCL_0030
human: RPE1 FUBP1 WT: hTERT-RPE1 NatNeo Manuel Kaulich N/A
Cas9 Mono Puro sens
human: RPE1 FUBP1 KO: hTERT-RPE1 NatNeo This study N/A
Cas9 Mono Puro sens FUBP1 -/-
human: RPE1 FUBP1 Nbox-mut: hTERT-RPE1 This study N/A
NatNeo Cas9 Mono Puro sens FUBP1 indel 31-40
human: HEK293 DSMZ ACC305
Oligonucleotides
See Table S5 (too many oligos to list here) N/A
Recombinant DNA
See Table S6 (too many plasmids to list here) N/A
Software and algorithms
Topspin 3.5 Bruker https://www.bruker.com/en/products-
and-solutions/mr/nmr-software/
topspin.html
NMRpipe Delaglio et al.93 https://www.ibbr.umd.edu/nmrpipe/
index.html
NMRFAM-Sparky Lee et al.94 https://nmrfam.wisc.edu/nmrfam-sparky-
distribution/
CYANA 3.98.13 Gu€ntert95 https://cyana.org/wiki/Main_Page
TALOS+ Shen et al.96 https://spin.niddk.nih.gov/bax/software/
TALOS/
ARIA2.3 Rieping et al.97 http://aria.pasteur.fr/
ProcheckNMR Laskowski et al.98 https://www.ebi.ac.uk/thornton-srv/
software/PROCHECK/
PSVS Bhattacharya et al.99 https://montelionelab.chem.rpi.edu/
PSVS/PSVS/
MolMol Koradi et al.100 https://sourceforge.net/p/molmol/wiki/
Home/
PYMOL Schrödinger and DeLano101 https://pymol.org/2/
ImageJ 2.1.0 Schindelin et al.102 https://imagej.net/
MicroCalPEAQ ITC Analysis software Malvern Panalytical https://www.malvernpanalytical.com/
Agilent TapeStation Software 5.1 Agilent https://www.agilent.com
Image Lab 6.0.1 build 34 bio-rad https://www.bio-rad.com/
MATLAB Coleman et al.103 https://www.mathworks.com/
(Continued on next page)
e2 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
66
ll
Article OPEN ACCESS
Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
R 4.1.1. Core Team104 https://www.r-project.org/
MAJIQ v2.3 Vaquero-Garcia et al.105 https://majiq.biociphers.org/
FAS Dosch et al.106 https://github.com/BIONF/FAS
fDOG N/A https://github.com/BIONF/fDOG
STAR Dobin et al.107 https://github.com/alexdobin/STAR
Cutadapt 2.4 Martin108 https://cutadapt.readthedocs.io/en/stable/
Samtools v1.9 Danecek et al.109 http://www.htslib.org/
Subread tool suite v1.6.2 Liao et al.110 https://subread.sourceforge.net/
FastQC v0.11.8 N/A https://www.bioinformatics.babraham.ac.uk/
projects/fastqc
FASTX-Toolkit v0.0.14 N/A http://hannonlab.cshl.edu/fastx_toolkit/
seqtk v1.3 N/A https://github.com/lh3/seqtk/
Flexbar v3.4.0 Roehr et al.111 https://github.com/seqan/flexbar
PureCLIP v1.3.1 Krakau et al.112 https://github.com/skrakau/PureCLIP
ViennaRNA Package 2.4.17 Lorenz et al.113 https://www.tbi.univie.ac.at/RNA/
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Julian
König (j.koenig@imb-mainz.de).
Materials availability
All unique/stable reagents generated in this study are available from the lead contact.
Data and code availability
d RNA-seq, in vivo and in vitro iCLIP data have been deposited at GEO and are publicly available as of the date of publication.
Accession numbers are listed in the key resources table. Protein structures have been deposited to the Protein Data Bank and
are available under the accession number 8P25. NMR data used for structure calculation are deposited in the BMRB under the
accession code 34816. Original Western blot, gel images and capillary electrophoresis images have been deposited at Men-
deley Data and are publicly available as of the date of publication. The DOI is listed in the key resources table.
d This paper analyses existing, publicly available data. These accession numbers for the datasets are listed in the key re-
sources table.
d All original code has been deposited at GitHub and is publicly available as of the date of publication at https://doi.org/10.5281/
zenodo.8076768.
d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS
RPE1 cell lines and culture conditions
The hTERT-RPE1 NatNeo Cas9 Mono Puro sens cell line was a generous gift of the Kaulich lab at the Frankfurt CRISPR/Cas
Screening Center (FCSC) and are modified from original hTERT RPE1 cells (ATCC, CRL-4000). Cells were grown and maintained
in Dulbecco’s modified Eagle’s medium (DMEM): Nutrient Mixture F-12 (DMEM/F-12; Thermo Fisher 11530566), supplemented
with 10% fetal bovine serum (PAN-Biotech), 2 mM glutamine (Thermo Fisher), 1% penicillin–streptomycin (Thermo Fisher), and
20 mg/ml hygromycin B (Thermo Fisher). Cells were incubated at 37#C with 5% CO2. Subcultivation was performed with 3 ml of
0.1% trypsin every 2–3 days for 20 passages. After that, new cells were thawed from stocks containing 13106 cells in 1 ml of growth
medium, supplemented with 10%DMSO and 50% fetal bovine serum (FBS). For semi-quantitative RT-PCR, 13105 RPE1 cells were
seeded into one well of a six-well plate (Falcon), one day prior to transfection. DNA (2 mg) was diluted in 100 ml of OptiMEM and trans-
fected with 6.4 ml of Fugene HD reagent (Promega). Cells were incubated at 37#C with 5% CO2 for 24 h before harvesting. For RNA-
seq, 1.53106 cells were seeded in a 10-cm cell culture dish (Corning) 48 h prior to isolation.
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e3
67
ll
OPEN ACCESS Article
HeLa cell line and culture conditions
HeLa cells (ATCC CCL-2) were grown and maintained in DMEM (Thermo Fisher), supplemented with 10% FBS, 2 mM glutamine
(Thermo Fisher) and 1%penicillin–streptomycin (Thermo Fisher). Cells were incubated at 37#Cwith 5%CO2. Subcultivation was per-
formed with 3 ml of 0.1% trypsin every 2–3 days for 20 passages. After that, new cells were thawed from stocks containing 13106
cells in 1 ml of growth medium, supplemented with 10% DMSO (Sigma) and 50% FBS.
HEK cell line and culture conditions
HEK293 cells (DSMZ) were grown and maintained in DMEM (Thermo Fisher), supplemented with 10% fetal bovine serum (PAN-
Biotech), 2 mM glutamine (Thermo Fisher) and 1% penicillin–streptomycin (Thermo Fisher). Cells were incubated at 37#C with 5%
CO2. Subcultivation was performed with 1 ml of 0.05% trypsin every 2–3 days for up to 15 passages. Then, new cells were thawed
from stocks containing 23106 cells in 1 ml of growth medium, supplemented with 10% DMSO (Sigma) and 90% FBS.
Recombinant protein expression
Proteins were expressed in E. coliBL21 (DE3) cells grown in LBmedium orM9minimal medium supplemented with 1 g/l 15NH4Cl and
2 g/l 13C-glucose (uniformly labeled) at 37#C. Protein expression was induced with 1.0 mM isopropyl b-D-1-thiogalactopyrano-
side (IPTG).
METHOD DETAILS
Establishing FUBP1 KO/Nboxmut cell lines
FUBP1wasmutated and knocked out using the CRISPR/Cas9 system in hTERT-RPE1 NatNeomono puro sens cells. This cell line is
puromycin sensitive and expresses Streptococcus pyogenes Cas9 under neomycin resistance. For the creation of the FUBP1 KO
and FUBP1-Nboxmut RPE1 cell lines, cells were cultured as described above with the addition of neomycin (G418, InvivoGen) to pre-
serve Cas9 expression. Guide RNA (gRNA) was amplified from oligos #54 and #55 (Table S5) with Phusion Polymerase (New England
Biolabs) and in vitro transcribed with TranscriptAid EnzymeMix (Thermo Fisher) according to themanufacturer’s protocol. Cells were
then transfectedwith the resulting gRNA using Lipofectamine CRISPRMAX (Thermo Fisher) according to themanufacturer’s protocol
and incubated for 48 h. To assess the general editing efficiency, a GeneArt Genomic Cleavage Detection Assay (Thermo Fisher) was
performed. Edited cells were then sorted by fluorescence-activated cell sorting (FACS), and each cell was cultured in a separate well
of a 96-well plate (Corning). From each clonal cell line, genomic DNA (gDNA) was isolated and amplified by PCR. The successful
disruption of the targeted site was validated by enzyme restriction and Sanger sequencing (StarSEQ GmbH, Mainz, Germany) of
the colonies. To obtain the novel sequence of the targeted site on both alleles, gDNA was also cloned into TOPO vectors using
the Zero Blunt TOPO PCR Cloning Kit (Thermo Fisher), and the obtained plasmids were Sanger-sequenced. All Sanger sequencings
were performed with oligo #56 (Table S5). The edited sequences led to mutated protein products, as shown in Figure S5G.
Immunoblotting
For each hTERT RPE1-derived cell line, 13106 cells were seeded on a 10-cm cell culture dish (Corning) and harvested after incuba-
tion for 48 h at 37#C, 5% CO2. Cells were lysed in modified RIPA buffer containing 50 mM Tris-HCl, 150 mM NaCl, 1 mM EDTA, 1%
NP-40 (Sigma), 0.1% sodium deoxycholate (Sigma) and supplemented with cOmplete Protease Inhibitor Mix (Sigma), and TURBO
DNase (Thermo Fisher) for 15 min on ice. Cell debris was precipitated by centrifugation at 16,000 3g for 15 min at 4#C. The cleared
protein lysate was transferred into a new reaction tube (Eppendorf) and the concentration was measured with a BCA Protein Assay
Kit (Thermo Fisher). 20 mg of protein lysate was mixed with 43 NuPAGE LDS Sample Buffer and heated to 70#C for 10 min. Samples
were loaded onto a NuPAGE 1 mm, 4–12% Bis-Tris Mini Protein Gel (Thermo Fisher) and electrophoresis was performed at 180 V,
400 mA for 50 min on a NuPAGE Novex Gel System (Invitrogen). Protein transfer to a nitrocellulose membrane (VWR International)
was performed at 30 V, 400mA over 60min using the same gel system. Themembrane was blocked in 5%milk diluted in PBS-T. The
primary antibody (key resources table) was incubated overnight at 4#C, and the secondary antibodywas incubated for 60min at room
temperature. All antibodies were diluted in 5%milk–PBS-T. Between blocking and primary and secondary antibody steps, the mem-
brane was washed three times with PBS-T. Detection was performed with SuperSignal West PICO Chemiluminescent Substrate
(Thermo Fisher) and BioRad GelDoc (BioRad).
RPE1 RNA-seq
For RPE1 RNA sequencing (ID: imb_koenig_2020_12) and semi-quantitative RT-PCR analysis, RPE1 cells were grown as described
above. Cells were washed once with DPBS and harvested with a cell scraper in l ml of DPBS. Suspensions were centrifuged at
1,000 3g for 1 min at 4#C. RNA was isolated from cell pellets using an RNeasy PLUS Mini Kit (Qiagen) according to the manufac-
turer’s protocol. For sequencing, RNA concentration was measured by Qubit RNA BR Assay and integrity of the RNA was confirmed
by Bioanalyzer RNA Nano Assay (Agilent). Ribosomal RNA was removed and the remaining RNA was reverse transcribed into cDNA
using the TruSeq library preparation kit with Ribo-Zero Gold (Illumina). The libraries were sequenced on an Illumina NextSeq 500
sequencer as 159-nt single-end reads.
e4 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
68
ll
Article OPEN ACCESS
HeLa RNA-seq
200,000 cells were seeded per well in a six-well dish 24 h prior to siRNA treatment. RNA-seq to assess intron splicing in HeLa cells (ID:
imb_koenig_2018_18) was performed in four replicates. HeLa cells underwent a control knockdown (KD) with no-target siRNA. Oli-
gos #40–#43 (Table S5) were delivered into cells using 3 ml of Lipofectamine RNAimax (Thermo Fisher) in 100 ml of OptiMEM to
achieve a final siRNA concentration of 20 nM. Cells were harvested after incubation for 48 h. RNA was isolated from cell pellets using
RNeasy PLUS Mini Kit (Qiagen) according to the manufacturer’s protocol. RNA concentration was measured by Qubit RNA BR
Assay, and integrity of the RNA was confirmed by Bioanalyzer RNA Nano Assay (Agilent). Ribosomal RNA was removed and the re-
maining RNA was reverse transcribed into cDNA using the TruSeq library preparation kit with Ribo-Zero Gold (Illumina). RNA-seq
samples were sequenced on an Illumina NextSeq 500 sequencer as 84-nt single-end reads.
Semi-quantitative RT-PCR
The MPDZ minigene was created from HeLa gDNA extracts by amplification of chr9:13,183,353-13,189,041 with Phusion
HighFidelity Polymerase (New England Biolabs). The PCR fragment was cloned into a pCR2.1 vector by Gibson assembly (IMB Pro-
tein Production Core Facility). MPDZ introns were shortened using a Q5 Site-Directed Mutagenesis Kit (New England Biolabs), re-
sulting in MPDZDintron, which lacks chr9:13,186,637-13,188,633 and chr9:13,183,736-13,186,120, MPDZDBS, lacking
chr9:13,186,494-13,186,618 and chr9:13,183,632-13,186,718, and MPDZDintron+DBS, lacking chr9:13,186,494-13,188,633 and
chr9:13,183,632-13,186,120 (Figure S6B). The open reading frames for GFP and the FUBP1 variants (FUBP1FL, FUBP1DN,
FUBP1A38D, FUBP1DC, FUBP1W586,615R) used in the complementation assay were integrated in a pcDNA5 vector containing a
CMV promoter and an N-terminal GFP tag, which was then used to transform DH5alpha cells (Invitrogen). All expression vectors
and minigenes are described in Table S6. Plasmid purification was performed with the Qiaprep Spin Miniprep Kit (Qiagen) or the
Qiaprep Plasmid Plus Midi Kit (Qiagen). Sequences were verified by Sanger sequencing. All hTERT RPE1 cell lines were seeded,
transfected, and harvested as described in the section "RPE1 cell culture". For complementation, an equimolar amount of expression
vector andminigene was used. RNAwas isolated with the RNeasy Plus Mini Kit (Qiagen) and reverse transcribed using the RevertAid
First Strand cDNA Synthesis Kit (Thermo Fisher). The minigene cDNA was then amplified using OneTaq DNA Polymerase according
to the manufacturer’s protocol and oligos #57 and #58 as primers (Table S5). Splicing products were assessed on a High Sensitivity
D1000 ScreenTape (Agilent) (Figure S6D). The percent spliced-in (PSI) value for the alternative exon was determined using the
following formula: Inclusion / (Inclusion + Skipping). PSI values in the complementation experiment were normalized to the mean
of the wild-type (WT) within each condition. Statistical significance was assessed by Student’s t-test and multiple testing correction
was performed using the false discovery rate (FDR).
In vivo iCLIP
In vivo iCLIP was used to study protein–RNA interactions with individual nucleotide resolution.43 For the U2AF2 in vivo iCLIP study,
data from two iCLIP experiments were combined. The first U2AF2 and PTBP1 in vivo iCLIP experiments were performed as previ-
ously described.114 The secondU2AF2 in vivo iCLIP experiment aswell as in vivo iCLIP experiments on FUBP1, SF1, and SF3B1were
performed using the iCLIP2 protocol as previously described.44 In brief, HeLa cells were irradiated (150 mJ/cm2) in a CL1000 UV
crosslinker (UPV) to covalently bond the RNA-binding proteins to the bound nucleic acids. For in vivo iCLIP of FUBP1, crosslinking
was achieved by 4-thiouridine (4sU)-mediated crosslinking (see section below). During subsequent cell lysis, the lysate was DNase-
treated with TURBO DNase (Thermo Fisher) and RNA was partially digested to create 50–200-nt fragments. Immunoprecipitation of
the investigated proteins was performed with antibodies listed in the key resources table. The anti-PTBP1 antibody was a kind gift
from Christopher Smith.115 Radioactive labeling at the 30 end of the precipitated RNA enables visualization of the RNP complex by
SDS-PAGE and transfer to a nitrocellulose membrane. After recovery of protein–RNA complexes from the membrane, proteinase K
digestion resulted in protein-free RNA. cDNAwas synthesized by reverse transcription, which stops at the crosslinked site, leading to
truncated reads in the sequencing. The cDNAwas cleaned twice using MyONE Silane beads (Thermo Fisher). PCR amplification and
ProNex size selection were performed to amplify and purify the library, respectively. In vivo iCLIP libraries (except PTBP1 libraries)
were sequenced on an Illumina NextSeq 500 sequencer as 92-nt single-end reads including a 6-nt (or 4-nt in the case of the first
U2AF2 iCLIP) sample barcode as well as 5+4-nt (or 3+2-nt) unique molecular identifiers (UMIs). PTBP1 iCLIP libraries were
sequenced on an Illumina GA-II machine116 and then re-sequenced on an Illumina HiSeq 2000 machine as 50-nt single-end reads
including a 4-nt sample barcode and 3+2-nt UMIs.
4-thiouridine crosslinking of FUBP1 in vivo iCLIP
For the FUBP1 in vivo iCLIP, HeLa cells were 4sU-labeled by adding 0.1M 4sU in DMSO to a final concentration of 100 mM in a 10-cm
cell culture dish. Cells were incubated for 16 h at 37#C, 5% CO2, with the exclusion of light. After incubation, the cells were moved
onto ice, shielded from light and irradiated at 365 nm, 800 mJ. Then, iCLIP was performed as described above.
In vitro iCLIP
In vitro iCLIP measures the intrinsic RNA-binding affinity of an RNA-binding protein (RBP).28 To that end, recombinant proteins and
in vitro transcripts resembling long natural transcripts28 or a large-scale RNA pool transcribed from an oligonucleotide library were
mixed and subjected to UV crosslinking and immunoprecipitation of the RBP of interest.
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e5
69
ll
OPEN ACCESS Article
Production of recombinant proteins
N-terminally 6xHis-tagged U2AF2RRM12 was purified as previously described.28 In brief, a recombinant construct (Table S6) was ex-
pressed in E. coli BL21-CodonPlus (DE3)-RIL cells (Agilent) for 3–4 h at 37#C using LB-Media and 1 mM IPTG. U2AF2RRM12 was pu-
rified using Ni Sepharose 6 Fast Flow beads (GEHealthcare) according to themanufacturer’s protocol, and concentrated with Spin-X
UF 500 5K MWCO columns (Corning) to a concentration of 1.156 mg/ml before being flash-frozen in liquid nitrogen and stored at
"80#C. All three N-terminally 6xHis-tagged FUBP1 protein variants (FUBP1FL, FUBP1DN, FUBP1N74; Table S6) were expressed over-
night at 16#C using LB media and 1 mM IPTG. Cells were lysed in lysis buffer (50 mM Tris-Cl, pH 8.0, 500 mM NaCl, 1 mM DTT, 5%
glycerol, EDTA-free cOmplete protease inhibitor cocktail), using a CF1 Cell Disrupter (Constant Systems). Lysates were cleared by
centrifugation (40,000 3g, 30 min, 4#C). Recombinant proteins were affinity-purified from cleared lysates using an NGC Quest Plus
FPLC system (Biorad) and a HisTrap FF 5 ml column (Cytiva) according to the manufacturers’ protocols. Full-length FUBP1FL and
FUBP1DN proteins were diluted 1:10 in heparin binding buffer (30 mM Na-HEPES, 20 mM NaCl, 5% glycerol, 1 mM DTT, pH 7.4),
loaded onto a Heparin HP 5 ml column (Cytiva) and eluted over 15 column volumes using a linear gradient of 20–1000 mM NaCl
in the heparin binding buffer. All FUBP1 variants were concentrated using Amicon 15 ml spin concentrators (Merck Millipore) and
subjected to gel filtration (Superdex 200 16/60 pg in 30 mM Na-HEPES, 100 mM NaCl, 1 mM DTT, 5% glycerol, pH 7.4). Peak frac-
tions containing the recombinant proteins after gel filtration were pooled, and protein concentration was determined by using absor-
bance spectroscopy and the respective extinction coefficient at 280 nm, before aliquots were flash-frozen in liquid nitrogen and
stored at "80#C. For the detailed workflow, log files can be requested from Dr. Julian König.
Preparation of long in vitro transcripts
Long in vitro transcripts were prepared as described in Sutandy et al.28 Minigene and spike-in RNAs were created by PCR amplifi-
cation of DNA templates using Phusion High-Fidelity DNA Polymerase (New England Biolabs) according to the manufacturer’s pro-
tocol. In vitro transcription of gel-purified PCRproducts was performed usingHiScribe T7High Yield RNASynthesis Kit (New England
Biolabs) according to the manufacturer’s instructions. RNA was isolated with the RNeasy Plus Mini Kit (Qiagen), followed by DNA
digestion with TURBO DNase and another RNA extraction. RNA quality was verified by capillary electrophoresis using High Sensi-
tivity RNA ScreenTape (Agilent). RNA concentration was measured with a Qubit RNA HS Assay Kit (Thermo Fisher). Aliquots of equi-
molar mixes of all minigenes as well as spike-in aliquots were stored at "80#C.
In vitro iCLIP with long in vitro transcripts
In vitro iCLIP with long in vitro transcripts (ID: imb_koenig_2018_01_sub16) was performed for U2AF2RRM12 alone or supplemented
with different FUBP1 variants. The experiment was performed with a pool of eight in vitro transcripts (C4BPB, MPDZ, MYC, MYL6,
NF1, TENT2, PCBP2, and PTBP2, see GEO record GSE220183) as previously described.28 The in vitro transcripts were preheated for
5 min at 70#C to minimize RNA secondary structure. Then, in vitro transcripts at a final concentration of 2 nM were added to 50 nM
U2AF2RRM12 either alone (three replicates) or supplemented with either 50 nM FUBP1FL (two replicates), 50 nM FUBP1DN (two rep-
licates), or 50 nM FUBP1N74 (two replicates). The mixtures were incubated at 37#C for 5 min before UV irradiation at 50 mJ/cm2. The
in vitro iCLIP reaction was spiked with 10 ml of crosslinked mixture containing 250 nM U2AF2RRM12 and 6 nM NUP133 in vitro tran-
script for normalization.28 Partial RNase digestion and DNase treatment, followed by the standard iCLIP protocol, were performed as
described in the section "In vivo iCLIP". After reverse transcription, the cDNA was purified and libraries were generated according to
the iCLIP2 protocol.44
Preparation of oligo-derived transcripts
A total of 1,998 DNA oligonucleotides were chosen to represent 182-nt regions around 30 splice sites, including the last 132 nt up-
stream of a 30 splice site and the first 50 nt of the downstream exon, preceded by 18 nt of T7 promoter sequence for the reverse
transcription. The genomic coordinates of all regions represented in the oligonucleotide library are listed in GEO record
GSE220183. The DNA oligonucleotides were purchased from TWIST Bioscience (South San Francisco, CA). Before in vitro transcrip-
tion, L3 adapter ligation was performed. This was achieved by resuspending the DNA pellet in T4 RNA ligase (New England Biolabs)
mix containing a 1:10 oligo/adapter ratio for high ligation efficiency. This mixture was reacted overnight at 16#C at 1300 rpm and then
inactivated at 98#C for 5 min. L3-ligated DNA oligonucleotide (2.6 ng) was amplified using the Phusion High-Fidelity DNA Polymerase
(New England Biolabs) according to the manufacturer’s protocol. Amplicons were purified twice using the ProNex Dual Size-
Selective Purification System (Promega) with an optimized bead/library ratio of first 1.13 and then 0.5. Capillary electrophoresis
with a High Sensitivity D1000 ScreenTape (Agilent) was used for quality control. Then, in vitro transcription was performed for 4 h
at 37#C by following the HiScribe T7 (New England Biolabs) protocol for short transcripts. Subsequently, RNA was treated with
TURBO DNase I and isolated using Qiagen’s protocol for "Total RNA containing small RNA from cells" (RNeasy Plus Mini Handbook,
Appendix E) with the reagents mentioned above.
in vitro iCLIP on oligo-derived transcripts
For in vitro iCLIP with an oligonucleotide-derived RNA pool (ID: imb_koenig_2018_01_sub12), the oligonucleotide-derived transcript
pool at a concentration of 50 nM was preheated for 5 min at 70#C and incubated with 50 nM U2AF2RRM12 alone or with either 50 or
300 nM FUBP1FL (three replicates each) for 10 min before UV irradiation at 50 mJ/cm2. iCLIP was performed as described in the sec-
tion "In vivo iCLIP", omitting the partial RNase digestion and L3 linker ligation steps as they do not apply here. The reaction was
spiked with a mix of 10 150-nt long spike-in oligonucleotides for normalization (oligos #44–#53; Table S5).
e6 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
70
ll
Article OPEN ACCESS
Sequencing and data preprocessing
In vitro iCLIP libraries were sequenced on an Illumina NextSeq 500 sequencer as 150-nt single-end reads including a 6-nt sample
barcode as well as 5+4-nt UMIs. The reads were bioinformatically preprocessed as described for in vivo iCLIP samples. The number
of uniquely mapped reads for all in vitro iCLIP samples are given in Table S1.
Protein expression and purification
All plasmids encoding sequences of FUBP1, U2AF2, chimeric U2AF2linker-RRM2/FUBP1N-box (linked by a 14 GS linker), SF1, SNRPA,
SNRPB, and PRPF40B were cloned into the pETM11 vector or pET24 vector with a His tag, His-GB1 tag, or His-protein A tag, fol-
lowed by a TEV cleavage site. The point mutants of FUBP1 were generated by site-directed mutagenesis. All constructs are listed in
Table S6.
Recombinant proteins were expressed in E. coli BL21 (DE3) cells in LB medium or M9 minimal medium supplemented with 1 g/l
15NH Cl and 2 g/l 134 C-glucose (uniformly labeled). After growth of the bacterial cells to an OD600 value of 0.8, protein expression was
induced with 1.0 mM IPTG followed by overnight expression at 18#C. After resuspension in 50mMTris, pH 8.0, 500mMNaCl, 10mM
imidazole (supplemented with lysozyme, 1 mg/ml DNase, 2 mMMgSO4, and protease inhibitor), the cells were lysed using a French
press. Cleared lysates were added to Ni–NTA resin, washed with 2 M NaCl and eluted with 500 mM imidazole. The His tag was
cleaved with His-tagged TEV protease at 4#C overnight. The protein was further purified by removing the cleaved His tag, uncleaved
protein and TEV protease from the desired protein on a second Ni–NTA column. All proteins were further purified by ion-exchange
chromatography on RESOURCE S or RESOURCE Q columns (Cytiva) (20 mM Tris, pH 8.0 or 20 mM sodium phosphate, pH 6.5,
gradient from 0 to 1 M NaCl in 10 column volumes) followed by size-exclusion chromatography on a HiLoad 16/600 Superdex 75
column (GE Healthcare) (20 mM sodium phosphate, pH 6.5, 150 mM NaCl).
NMR spectroscopy
All NMR samples (13C15N- or 15N-labeled, as appropriate) were measured at concentrations of 0.1–1 mM in NMR buffer (20 mM so-
dium phosphate, pH 6.5, 50 mMNaCl, 2 mMDTT) containing 10% (v/v) D O at 25#2 C on 900-, 800-, 600-, or 500-MHz Bruker Avance
NMR spectrometers (cryogenic triple-resonance gradient probes). The NMR spectra were processed with TOPSPIN3.5 (Bruker) or
NMRPipe93 and analyzed using NMRFAM-Sparky.94
Chemical shift assignment
Protein backbone assignments were obtained from standard HNCA, HNCACB, CBCA(CO)NH, HNHA backbone experiments. Spe-
cifically, for KH domains, the 1H–15NHSQC spectrum of KH1–4was first assigned, then corresponding assignments were transferred
to the spectra of the individual and tandem KH domains. Further side-chain resonances were assigned using CC(CO)NH, HCC(CO)
NH, hCCH-TOCSY and HcCH-TOCSY experiments. The distance restraints for structure calculations were obtained from 3D 15N-
and 13C-edited NOESY–HSQC experiments.117,118 Secondary structure propensities were derived from the difference of Ca and
Cb chemical shifts to the random coil shifts.
119–121
Relaxation experiments
15N-relaxation experiments were recorded on an 800 MHz Bruker Avance NMR spectrometer at 25#C and 15N T1 and T2 relaxation
times were acquired from pseudo-3D HSQC experiments in an interleaved manner with eight relaxation delays for T1 (20, 60, 100,
200, 400, 600, 800, 1200 ms) and nine relaxation delays for T2 (16.96, 33.92, 67.84, 101.76, 135.68, 169.6, 254.4, 305.28,
339.2 ms).122 Residual relaxation rates were obtained by fitting the data to an exponential function using NMRFAM-Sparky.94
Titrations
For NMR titrations, 1H–15N HSQC spectra were measured after each addition of titrant and the changes were visualized by calcu-
lating the CSP.123 The KD values were calculated from NMR titrations by plotting the CSP of selected peaks (8) against the ligand
concentration and fitting the data as previously described. Standard deviations of the mean were calculated from KD values of
the 8 selected peaks.124
Structure calculation
To stabilize the U2AF2 and FUBP1 interaction, a chimeric construct of U2AF2RRM2 and FUBP1N-box was introduced for the subse-
quent structure determination (Table S6). Overall structural integrity of the chimeric construct and recapitulation of the interaction
was confirmed by comparing 1H–15N HSQC spectra of the chimeric construct to that of the intermolecular complex U2AF2-
RRM2–FUBP1N-box (Figures S3I and S3J). CYANA3 (3.98.15) was used for automated NOE assignments and initial structure calcu-
lations.95 To overcome partial signal broadenings for the resonances at the interface of the two domains, possibly due to the weaker
affinity, additional unambiguous intramolecular distance restraints from 13C-NOESY–HMQC and methyl-NOESY spectra were
manually assigned and included in the structure calculation.125 A minimal number of typical hydrogen bonds, which were confirmed
by 15N-edited NOESY and secondary structure propensity, was implemented to assist the initial folding during the structure calcu-
lation. Dihedral angle restraints were derived from SSP and 13C secondary chemical shifts using TALOS+, including resonances of
Ca, Cb, C, H, and N.96,126 For water refinement, distance restraints from CYANA3 considering an error of ± 0.5 Å are used. Water
refinement127 of the 20 lowest-energy structures (500 initial structures) was performed with ARIA2.397 and CNS.128 The quality of
the 10 final structures was evaluated by ProcheckNMR98 and PSVS.99 Ensemble structure root mean square (r.m.s.) deviations
were calculated using MolMol100 and the ribbon representations were prepared in PyMOL (The PyMOLMolecular Graphics System,
version 1.8.6.0, Schrödinger, LLC). Structural statistics are shown in Table 1.
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e7
71
ll
OPEN ACCESS Article
Scaffold-independent analysis
For the initial screening, the 16 DNA pools of 5-mer DNA (Table S5, #63, IDT), instead of RNA due to their similarity in binding, were
generated by introducing a specific nucleotide at a designated position while randomizing the other four positions. Titrations of
100 mM FUBP1 KH domain samples with the different DNA pools (0.5, 1.0, 2.0, and 4.0 molar equivalents of titrant to analyte)
were performed at 25#C in NMR buffer (20 mM sodium phosphate, pH 6.5, 50 mM NaCl, 2 mM DTT) containing 10% (v/v) D2O by
recording SOFAST HMQC spectra on a 600 MHz Bruker Avance NMR spectrometer (cryogenic triple-resonance gradient probe).
For the comparison and identification of position-specific nucleotide preference, we focused on a subset of 12 representative peaks,
which show visibly clear changes in chemical shift (fast-exchange regime) and are therefore involved in binding, for further analysis.
CSPs of these peaks were calculated (see above) and the average CSPs of all peaks for each pool were normalized against the
largest CSP calculated in the four pools to obtain a score for nucleotide preference at a specific position. The final optimized motifs
were verified by comparing the chemical shift changes upon adding either DNA or RNA for all KH domains (Table S5, #67–72).129
In vitro binding assays
In vitro transcription
All RNA samples were in vitro transcribed using T7 RNA polymerase, precipitated by ethanol and purified by denaturing PAGE (12%
polyacrylamide gel containing 8 M urea). The DNA templates for in vitro transcription are shown in Table S5 (Oligos #59–62). The gel
slices were electro-eluted at 250 V in 0.53 TBE. To promote proper folding, the RNA samples were heated to 95#C for 2 min and
subsequently snap-cooled on ice before use.
Fluorescent EMSA
In vitro-transcribed RNA was fluorescently labeled by ligation of pCp-Cy5 to the 3’ end of the RNA with T4 RNA ligase 2. Subse-
quently, the reaction was purified using a spin column kit (Norgen Biotek Corp.). For binding studies, 100 nM labeled RNA in
20mM sodium phosphate, pH 6.5, 50mMNaCl and glycerol (15% final concentration) was incubated with increasing concentrations
of FUBP1N-box+KH1-4 (amino acids 1–457) for 15 min. Mixtures were loaded onto a 0.7% agarose gel. Gel electrophoresis was per-
formed in 13 TBE buffer at 40 V for 4 h. Detection was performed using a Typhoon 9200 (GE Healthcare Life Sciences) at 649 nm.
Data analysis was performed in Image J 2.1.0.102 Experiments were repeated to estimate the standard deviation of the mean.
Isothermal titration calorimetry
ITC experiments were performed on aMicroCalPEAQ-ITC instrument (Malvern Panalytical) using non-isotopically labeled proteins as
analyte sample and titrant or non-isotopically labeled protein as analyte and DNA oligonucleotides as titrant in NMR buffer at 25#C.
U2AF2 constructs (concentration 15–30 mM) were titrated with FUBP1 N-terminal constructs (concentration 1.5–3.0 mM); FUBP1
double-KH domain constructs (concentration 20–30 mM) were titrated with DNA oligonucleotides (concentration 200–350 mM,
Table S5, #64–66); in vitro-transcribed ssRNA (VPS13D, 15 mM) was titrated with FUBP1KH (150 mM). Binding affinity analysis was
performed using MicroCalPEAQ-ITC Analysis Software (Malvern Panalytical). The standard deviations of the KD values were esti-
mated based on the differences in triplicate measurements.
BRET
BRET plasmid construction
The donor and acceptor vectors pcDNA3.1-cmyc-NL-GW (Addgene plasmid ID #113446), pcDNA3.1-GW-NL-cmyc (Addgene
plasmid ID #113447), pcDNA3.1-GW-mCit, pcDNA3.1-mCit-GW, as well as controls pcDNA3.1-NL-cmyc (Addgene plasmid ID
#113442), pcDNA3.1-PA-mCit (Addgene plasmid ID #113443), and pcDNA3.1-PA-mCit-NL-cmyc (Addgene plasmid ID #113444)
were kindly provided by theWanker group (Max-Delbru€ck-Centrum fu€r Molekulare Medizin, Germany). The GATEWAY entry vectors
pDON221 and pDON223 were provided by the Vidal group (Dana Farber Cancer Institute, Boston, MA). All vectors were amplified
and full-length sequenced using the primers given in Table S5. Full-length wild-type ORFs being cloned into GATEWAY entry vectors
were amplified from a human ORFeome collection.130 The ORFs were full-length sequenced using primers shown in Table S5. ORFs
of FUBP1, SNRNP70, and TCERG1 (Table S6) were PCR-amplified with primers #9–10, #27–28, and #33–34, respectively (Table S5)
and shuttled into pDON223 using a BP clonase II mix kit (Invitrogen). The Q5 site-directed mutagenesis kit (Invitrogen) was used to
produce the following mutants: pDON223-FUBP1_A38D, pDON223-FUPB1_W586R_W615R, and pDON223-FUBP1_1-530aa
(Table S6). For BRET experiments, all cDNAswere shuttled from the entry vectors into the BRET destination vectors using LR clonase
technology (Invitrogen) according to the manufacturer’s protocol. After the LR cloning step, the inserts were partially sequence-
confirmed. All primers used are given in Table S5 and all the constructs are listed in Table S6.
Transfection
The human embryonic kidney 293 cells were transfected using Lipofectamine 2000 (Invitrogen) transfection reagent in Opti-MEM
medium (Thermo Fisher) using the reverse transfectionmethod according to themanufacturer’s instructions. For BRET transfections,
cells were seeded at a density of 4.03104 cells per well on a white 96-well microtiter plate (Greiner) in phenol-red-free, high-glucose
DMEM media (Thermo Fisher) supplemented with 5% FBS (Thermo Fisher). Transfections were performed with a total amount of
200 ng of DNA per well. If the amount of expression plasmid was less than 200 ng in a well, pcDNA3.1 (+) was used as a carrier
DNA to achieve the total of 200 ng.
e8 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
72
ll
Article OPEN ACCESS
Experiments
Cells were transfected with plasmids encoding the acceptor (50 ng DNA) and donor (1 ng DNA). The plate was incubated for 2 days at
37#C, 5% CO2, and 85% relative humidity prior to measurement. All measurements were performed on an Infinite M200 Pro micro-
plate reader (Tecan). First, 100 ml of the medium was aspirated from each well. The mCitrine fluorescence was measured in intact
cells (excitation/emission 513/548 nm). Then, coelenterazine h (PJK Biotech GmbH) was added at a final concentration of 5 mM.
The cells were briefly shaken and incubated for 15 min inside the plate reader. After incubation, total luminescence was measured
first followed by short-wavelength and long-wavelength luminescence measurements using BLUE1 (370–480 nm) and GREEN1
(520–570 nm) filters at 1,000 ms integration time. Corrected BRET (cBRET) ratios were calculated as previously described.58 In brief,
for every transfected protein pair NL-A and mCit-B, the following two control pairs were measured: NL-Stop with mCit-B and NL-A
with mCit-Stop. The maximal BRET from both control pairs was subtracted from the actual test pair to correct for donor bleed-
through, nonspecific binding to the tags, and background signal.
Saturation assay
For donor saturation experiments 1 ng of donor DNA encoding NL-fused proteins was co-transfected with increasing amounts of
acceptor DNA encoding mCitrine-fused proteins (10, 25, 50, 100, 200, 400 ng). Fluorescence, total luminescence, and BRET
were measured as described before. BRET measurements were corrected for bleed-through using NL-Stop transfections. Fluores-
cence and total luminescence measurements were used to estimate the amount of expressed proteins and used to plot acceptor/
donor ratios on the x-axis.
QUANTIFICATION AND STATISTICAL ANALYSIS
Preprocessing of RNA-seq data
Prior to genomic mapping, remaining adapter sequences were trimmed in RNA-seq data from FUBP1 KO, FUBP1-Nboxmut, and
WT control RPE1 cells using Cutadapt v2.4.108 A minimal overlap of 1 nt between reads and adapter was required and only
reads with a length of at least 50 nt after trimming were retained for further analysis (parameters: -O 1 -m 50). Reads were mapped
using STAR v2.6.1b,107 allowing up to 4% of the mapped bases to be mismatched (--outFilterMismatchNoverLmax 0.04
--outFilterMismatchNmax 999) and a splice junction overhang (--sjdbOverhang) of 83 nt for HeLa WT samples and of 158 nt for
FUBP1 KO, FUBP1-Nboxmut, and WT control RPE1 cells. Genome assembly and annotation of GENCODE131 release 31 were
used during mapping. Subsequently, secondary hits were removed using Samtools v1.9.109 Exonic read counts per gene were ex-
tracted using featureCounts from the Subread tool suite v1.6.2110 with non-default parameters --donotsort -s2.
Preprocessing of in vivo iCLIP data
Basic quality controls were conducted in FastQC v0.11.8 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) and reads
were filtered based on sequencing qualities (Phred score) in the sample barcode and UMI regions using the FASTX-Toolkit
v0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit/) and seqtk v1.3 (https://github.com/lh3/seqtk/). All reads with a Phred score below
10 in the sample barcode or UMI regions were discarded. Reads were de-multiplexed based on the sample barcode, which is found
on positions 6–11 of the reads (for 6-nt sample barcodes) or on positions 4–7 (for a 4-nt sample barcode), using Flexbar v3.4.0.111
Subsequently, barcode regions and adapter sequences were trimmed from read ends using Flexbar, requiring a minimal overlap of
1 nt of read and adapter and adding UMIs to the read identifiers. Reads shorter than 15 nt were discarded. All empty space and slash
characters were removed from read identifiers in FASTQ files to prevent all information following thembeing lost duringmapping. The
downstream analysis was done as described in Chapters 3.4, 4.1, and 4.2 of Ref. 132. Genome assembly and annotation of
GENCODE131 release 31 were used during mapping with STAR v2.6.1b.107 The number of crosslinking events and peaks is given
in Table S1. To assess the genomic distribution of iCLIP crosslink nucleotides, we used the following hierarchy: ncRNA > 30
UTR > 50 UTR > coding sequence (CDS) > 30 splice site > 50 splice site > intron > intergenic (Figure 1B). 30 and 50 splice site regions
refer to 100 nt upstream/downstream. All other "deep-intronic" regions are called intronic regions.
Metaprofiles for in vivo iCLIP data
Four RNA-seq replicates from HeLa cells (imb_koenig_2018_18) served as the source for the identification of spliced introns. Map-
ping to the genome was performed in STAR v2.6.1b107 (Table S1). Coordinates and number of unique supporting junction reads
("ureads") of spliced introns were extracted from the SJ file output by STAR containing high-confidence splice junctions. In the
following, introns from the SJ file are called "SJ introns". SJ introns had to meet a reproducibility criterion (at least 3 out of 4 repli-
cates). In addition, all overlapping SJ introns were removed. Finally, introns were overlaid with GENCODE release 31 annotation
and filtered for level < 3, transcript support level < 4, and gene_type and transcript_type equal to "protein coding". This resulted
in 88,375 SJ introns. Branch point (BP) prediction was taken from LaBranchoR.133 LaBranchoR is based on hg19, liftOver to hg38
was done with the liftOver tool by UCSC.134 The median distance of BP to 30 splice sites was 25 nt. 88,008 out of 88,375 SJ introns
had an annotated BP. Introns were further filtered for a minimum length of 100 nt and a maximum length of 17,000 nt. Metaprofiles
were aligned at the BP. In vivo iCLIP replicates for each RBP were summed up and a signal threshold of 10 in the metaprofile region
("200 nt to +50 nt with respect to the BP) was imposed. Crosslinking signals per intron were normalized by "ureads" and averaged
per nucleotide over all introns. For display, the normalized signal was smoothed with a Gaussian window function and window size
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e9
73
ll
OPEN ACCESS Article
10. Binding enrichment for RNAmaps stratified by intron length and splice site features was calculated by taking the log2 fold change
of the ratio of the area under the curve (AUC) of each feature bin to the AUC in the shortest intron class or class with theweakest splice
site feature. The following regions were used for AUC quantification, always with respect to BP: ["100,"25] for FUBP1, [+5, +25] for
U2AF2, ["10, +10] for SF1, and ["30, "10] for SF3B1. The minimum signal in each region served as a background proxy and was
taken as the lower horizontal boundary in which the AUC was calculated. For RNA maps stratified by GC content, the average GC
content in the exonwas contrasted to the averageGCcontent in the first 100 nt of the downstream intron. Signal values for RNAmaps
aligned at 50 splice sites were not smoothed but normalized by the average signal in the first 100 nt of the intron. RNA maps condi-
tioned on exon rank: annotation of exons and downstream introns was extracted fromGENCODE release 31. BPs were annotated as
described above. SJ introns were matched to introns. Duplicated matches were resolved such that the intron with the shortest up-
stream exonwas taken. Five exon rank classes were extracted: 1st exon, exon ranks in [2,5), [5,12), [12, 144] and second to last exon.
In comparison to all other RNAmaps, crosslinking signals per intron were normalized to the total crosslinking signal in the last 100 nt
upstream of the 30 splice site. "ureads" correlates with exon rank and was thus not suitable as a normalization factor. RNA maps
conditioned on exon GC content: upstream exons were identified as for exon rank RNA maps. Total exon GC content over exon
length was extracted. Bins are as follows: [0.07,0.41], (0.41, 0.46], (0.46, 0.53], (0.53, 0.6], (0.6, 0.91]. RNA maps condition on intron
GC content: total intron GC content over intron length was extracted. Bins were as follows: [0.14, 0.36], (0.36, 0.4], (0.4, 0.46], (0.46,
0.55], (0.55, 0.9]. RNA maps for fixed intron length/differential GC content architecture followed by subsequent conditioning on dif-
ferential GC content/intron length. Here, RNA binding profiles were first stratified on one class of intron length/differential GC content
architecture, followed by stratification on all levels of the other factor. Binding for all RNA maps was quantified based on AUC as
described above. Analyses were performed in R v4.1.1.104
iCLIP binding site definition (peak calling)
Binding site definition for in vivo iCLIP was done with PureCLIP v1.3.1. on merged replicates.112 PureCLIP was issued with the op-
tions -iv ’chr1;chr2;chr3;’ -ld -nt 4. The crosslink sites identified by PureCLIP were post-processed as previously described.132 In
detail, individual crosslink sites within a distance of 5 nt were clustered into binding regions. The binding regions were resized to
obtain binding sites of a uniform width. To compare binding sites of different RBPs, we opted for 5-nt binding sites (i.e., 2 nt on either
side of the position with themaximum signal) for all of the RBPs investigated (FUBP1, U2AF2, SF3B1, SF1, PTBP1). Isolated crosslink
sites and binding regions of 2 nt were removed. Binding regions % 5 nt were centered on the position with the maximum crosslink
signal and extended by 2 nt on either side. Binding regions > 5 nt were divided into regions of 5 nt, by iteratively screening for the
maximum signal and extending of 2 nt on either side, excluding an overlap between binding regions. Finally, at least three positions
with crosslink events were required to only keep binding sites with sufficient support. To ensure sufficient support of binding sites in
the individual replicates of the experiment, a reproducibility filter was applied. In order to consider the varying number and size of
replicates for each experiment, we filtered for those binding sites with a total number of crosslink events higher than the 10%percen-
tile of the distribution of crosslink counts in the single replicate. In addition, aminimumof two crosslink events was required if the 10%
percentile in the replicate was below this threshold. This was required in at least two out of three, three out of four and three out of five
replicates depending on the number of replicates available for the respective experiment. The numbers of called binding sites per
protein are given in Table S1.
Saturation analysis
Spliced introns were identified from four RNA-seq replicates in HeLa cells (imb_koenig_2018_18) as described above. Introns were
retained if they were longer than 200 nt, and if the 50 splice site windows (the last 50 nt of the exon plus the first 75 nt of the intron) and
30 splice site windows (the last 200 nt of the intron plus the first 20 nt of the exon) were not overlapping. 30 splice sites overlapping to
noncoding and long noncoding RNAs were excluded, resulting in 98,328 30 splice sites. These splice sites were binned into percen-
tiles, based on "ureads" (splice site usage) averaged over replicates. RBP binding sites were assigned to curated 30 splice sites (the
last 200 nt of the intron), requiring full overlap. For each bin, the percentage of 30 splice sites with at least one binding site for the
specific RBP was calculated (Figure 1E).
Motif enrichment for in vivo iCLIP
Introns were defined based on GENCODE annotation (release 31). Annotation was filtered for level < 3, transcript support level < 4,
and gene_type and transcript_type equal to "protein coding", resulting in 202,623 introns. BP annotation was done as specified
above. 200,199 out of 202,623 had an annotated BP. Introns were further filtered for overlaps and for having a length of at least
250 nt upstream of the defined BP. The length requirement was set to ensure that the main position of FUBP1 binding was not
confounded with the 50 splice site signal. FUBP1 binding sites (n = 854,404) were filtered for positioning within a 150-nt window up-
stream of the BP, resulting in 167,408 binding sites. Binding sites were ranked by their normalized signal, that is, the signal in the
extended binding site (5 nt ± 5 nt) over total intron signal over intron length. Disjunct 4-mer frequencies were counted in the top/bot-
tom 20% binding sites based on normalized signal to account for overall crosslinking preferences. Additionally, non-bound intronic
regions in introns hosting the top 20% FUBP1 binding sites were also considered as an alternative background set. Here, disjunct
4-mer frequencies were calculated for all non-bound intronic regions, excluding a 20-nt region downstream of the 50 splice site and a
150-nt region upstream of the BP. Enrichment was defined as the distance from each data point to the diagonal in a scatterplot
e10 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
74
ll
Article OPEN ACCESS
comparing the top 20% versus bottom 20% binding sites and, alternatively, non-bound intronic sequences. Analyses were per-
formed in R v4.1.1.104
Motif enrichment upstream of branch points
Introns were extracted and BP annotated as above (200,199 introns left). Introns were further filtered for a minimum length of 500 nt
and disjunct 200-nt windows upstream of the BP, resulting in 151,836 introns. Disjunct 4-mer frequencies were calculated in a po-
sition-wise manner in a 200-nt window upstream of the BP. Average background motif frequencies were calculated in a 100-nt long
window 100 nt downstream of the 50 splice site. Enrichment was defined as the distance from each data point to the diagonal in the
scatterplot of position-wise frequencies versus average background frequencies.
Abundance of FUBP1 motif at 30 splice sites
Disjunct motif occurrences were counted in a 75-nt long window 25 nt upstream of the BP. The background distribution was derived
as the occurrences of nine randomly drawn motifs of length 4, repeated 100 times.
Analysis of in vitro iCLIP data
All samples weremerged for binding site definition (peak calling) across replicates and conditions. Each in vitro transcript was divided
into 9-nt windows, always shifted by one nucleotide. Windows were sorted by total signal and, while excluding overlapping peaks,
generating a candidate. A negative binomial distribution was fit (maximum likelihood fit) to the signals on the candidate peak list. All
peaks with a total signal exceeding the 90% quantile of the theoretical distribution were retained for final processing (109 peaks, see
GEO record GSE220183). The background ranges were the in vitro transcript regions minus extended peaks (9 nt ± 5 nt). For quan-
tifying the binding differences between conditions, replicates were averaged. Peak signals were normalized against background
signals. RNA maps were based on 21 30 splice sites present in the in vitro transcripts. To correct for differences in expression,
nucleotide-wise signals were normalized by total in vitro transcript signals. Subsequently, signals were summarized per nucleotide
by the 75%quantile. Replicates were averaged and subjected to Gaussianwindow smoothingwith window size 10 before display. All
analyses were performed in R v4.1.1.104
Analysis of oligo in vitro iCLIP
All data was normalized according to the total signal of all available spike-ins. Values were then extracted either per nucleotide or by
binding site. Binding site positions were taken from overlays with in vivoU2AF2 binding sites in the intronic part of the oligonucleotide.
1,831 oligonucleotides harbored an U2AF2 binding site in the intronic part (see GEO record GSE220183). If multiple binding sites
were present, that with the highest average signal in the U2AF2 samples was taken as representative. For quantifying the addition
of FUBP1 on U2AF2 binding sites, only those binding sites with signal greater than the 25% quantile in one of the three replicates
were considered, resulting in 1,504 binding sites. The absolute number of disjunct occurrences of the FUBP1 motif set ("TTTT"
and all combinations of "TTT" and either one "A" or one "G") was counted in a 75-nt long region located 25 nt upstream of the
BP. All analyses were performed in R v4.1.1.104
Intron length analyses of RNA-seq data
Splicing changes of FUBP1 KO and FUBP1-Nboxmut were analyzed with MAJIQ v2.2135,136 with default parameter settings. MAJIQ
outputs local splice variations (LSV), which were filtered as follows: for each LSV, the top two junctions in terms of absolute difference
in junction usage (delta percent selected index, |DPSI|) were taken as representative LSVs. At least one of these two junctions needed
to have an absolute DPSI > 0.1 and a detection probability > 0.9 (skipped for control events). Subsequently, events were filtered for
exon-skipping events. Each cassette exon was then annotated with the upstream and downstream intron: genomic coordinates of
the upstream/downstream intron were immediately defined in "source"/"target" events. The genomic coordinates of the respective
other intron were extracted from annotation (GENCODE release 31). Overlapping cassette exons were resolved such that the event
with the largest |DPSI| was retained (Table S3). A two-tailed Wilcoxon rank-sum test was used to assess statistical significance.
ENCODE data analysis
We retrieved raw RNA-seq data derived from an shRNA-knockdown experiment for FUBP1 in the cell line K562 from the ENCODE
data portal (https://www.encodeproject.org/), using accession numbers ENCSR608IXR (FUBP1 KD) and ENCSR260BQC (control).
Alignment was performed in STAR (version 2.7.8a)107 with standard ENCODE options. We applied MAJIQ v2.3135,136 to identify and
quantify cassette exons in the RNA-seq data. First, a splice graphwas built on the BAMfiles and theGENCODE gene annotation (v38,
human genome version hg38). Then, the difference in junction usage between knockdown and control samples was calculated (as
DPSI). Next, alternative splicing events such as cassette exons (CEs) were categorized and quantified in the splicing graph using
MAJIQ Modulizer. Probabilities were calculated for each junction, testing for |DPSI| > 0.05 (probability changing [Ps]) and |
DPSI| < 0.02 (probability non-changing [Pn]). TheMAJIQModulizer output was then processed in R, filtering for significantly regulated
CEs and a control groupwith unregulatedCEs. ACE is defined as significantly regulated if |DPSI|R 0.055 for all junctions, PsR 0.9 for
at least one junction pair (inclusion junction + skipping junction), the sign within both junction pairs is inverse, and within the junction
pairs the lower |DPSI| is at least 50% of the higher |DPSI|. A CE is considered to be unregulated if Pn R 0.5 and |DPSI| % 0.02 for all
junctions. Overall, this resulted in a total of 173 significantly regulated CEs and a control group with 1,910 unregulated CEs for further
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e11
75
ll
OPEN ACCESS Article
analysis. To categorize CEs into more included and less included, a representative DPSI was chosen for each CE based on the
maximum |DPSI| of both inclusion junctions. Based on this, there were 30 more-included and 143 less-included exons.
Splicing changes upon FUBP1 LoF mutations
Significant differentially spliced exon-skipping events upon (i) loss-of-function (LoF) mutations of FUBP1 in low-grade gliomas
(37 events), (ii) in FUBP1 siRNA knockdown in U87MG cells (109 events) and (iii) LoF mutations of other splicing factors (433)
were extracted from Seiler et al.1 Junction lengths comprise the upstream intron, the skipped exon and the downstream intron. A
two-tailed Wilcoxon rank sum test was used to assess statistical significance.
Mutations in FUBP1 in cancer patients
We searched multiple databases to identify disease-related mutations within the FUBP1 gene. We focused on the minimal binding
interface to U2AF2 (FUBP1 amino acids 25–56) to find mutations that potentially abolish the interaction with the U2AF2 RRM2
domain. The following databases were used: ICGC Data Portal,137 cBioPortal,138,139 Exac,140 Cosmic,141 GDC Data Portal,142 gno-
mAD,140 and ClinVar.143 All cancer-related mutations in FUBP1 in the observed region and the underlying cancer type are listed in
Figure S4B.
Scoring of splice site features
30 and 50 splice site strength was scored with MaxEnt scan.144 Py tract strength was determined as follows: a 39-nt region upstream
of the AG dinucleotide at the 3’ splice site was screened with sliding windows of increasing length (width 5–30 nt) to identify the win-
dow with the highest Py tract strength. The Py tract strength of each window was calculated as the X2 test statistic with 1 degree of
freedom, comparing the observed number of pyrimidineswith the expected number based on the assumption of a uniform nucleotide
distribution. In addition, candidate Py tracts were required to end within 10 nt upstream of the AG dinucleotide. Using this approach,
themedian length of identified Py tracts was 16 nt. BP strength was assessed according to the U2 binding energy, that is, the number
of hydrogen bonds between the candidate sequences and the BP binding sequences in the U2 snRNA. Hydrogen bonds form be-
tween A:T (2 bonds), G:C (2 bonds), and G:U (1 bond; in fact also 2 bonds, but punished for being a wobble base pair) with the BP
nucleotide bulging out and being omitted from the pairings. The Vienna RNA package v2.4.17113 (RNAduplex) was used to determine
the optimal hybridization structure between U2 snRNA sequences (GUGUAGUA) and the motif (position "5 to +3, excluding the BP
nucleotide). Predicted binding energy was the determined sum of hydrogen bonds forming between complementary motifs and U2
snRNA nucleotides.
Evolutionary analyses
We annotated the domain architecture of FUBP1 using the function annoFAS provided in the FAS package106 (https://github.com/
BIONF/FAS). The domain architecture-aware phylogenetic profile of FUBP1 across 174mammals, 274 non-mammalian vertebrates,
277 invertebrates, 410 fungal species, 94 protozoa, and 145 plants was generated with the targeted ortholog search tool
fDOG (https://github.com/BIONF/fDOG)145 using the human FUBP1 (UniProt: Q96AE4) as a seed. fDOG was run with the options
--minDist class, --maxDist phylum, –checkCoorthologsRef, and --countercheck. Homo sapiens (GenBank: GCF000001405) served
as the reference taxon. Intron length and GC content information was extracted based on the respective gff and fasta files down-
loaded from NCBI RefSeq Genome. Intron length estimates and motif searches were performed in R v4.0.5. A/B box presence in
the human proteome was determined as follows: in brief, we used the shell command grep to search for the regular expression
"[ST][AK][QA]W..YY[RK]" in 19,519 human proteins encoded in the NCBI RefSeq Genome assembly GCF_000001405.39. The result-
ing three hits were NCBI: XP_011540693.1 (FUBP1, 2 motif instances), NCBI: NP_003925.1 (FUBP3, 1 motif instance), and NCBI:
NP_001353228.1 (KHSRP, 3 motif instances). For counting FUBP1 motif occurrences across species, intron definitions were ex-
tracted for all the species investigated and motifs were counted in a 25-nt window located 25 nt upstream of the 30 splice site.
Analysis of RBP crosslinking to snRNAs
In vivo iCLIP data from FUBP1, U2AF, SF1, SF3B1, and PTBwas remapped to a custom database consisting of snRNAs, tRNAs, and
rRNAs using STAR v2.7.3a.107 Specifically, RNU1-1, RNU2-1, RNU4-1, RNU6-1, RNU5D-1, RNU7-1, RNU11, RNU12, RNU4ATAC,
and RNU6ATAC were included. tRNA coordinates were retrieved from GtRNAdb (data release 19). "hg38-tRNAs.fasta", containing
429 high-confidence tRNA annotations, was downloaded. Because tRNAs are quite similar when stratified on their carried amino
acid, one representative tRNA was selected per amino acid (tRNA with "1-1" in the name). In summary, this resulted in 22 tRNAs.
Finally, the following rRNAs were added: 12S_gi, 16S_gi, 18S_gi, 28S_gi, 5.8S_gi, and 5S_gi. Mapping steps were performed as fol-
lows: all sequences were furnished with one additional base upstream of the sequence with the rationale of being able to display
iCLIP coverage of reads starting directly at the 5’ end of the sequence. tRNAs and snRNAs were furnished with the actual base up-
stream of the sequence. rRNAs were furnished with an "N". Reads were mapped per replicate with STAR v2.7.3a using the settings
described above for in vivo iCLIP samples. Few reads were mapped to the minus strand and thus removed. Uniquely mapping reads
were subjected to duplicate removal based on identical UMIs (--method unique) using UMI-tools v1.0.0.146 Based on the remaining
reads, iCLIP coverage profiles were exported aswell as count tables containing the number of reads overlapping the genomic ranges
of the defined RNAs.
e12 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
76
ll
Article OPEN ACCESS
Subnuclear distribution of FUBP1-bound genes
The subnuclear spatial distribution for introns in HeLa cells was taken from Tammer et al.64, in which Chrom3D, a 3D genome-
modeling tool that integrates 3DHi-C data and ChIP-seq data was used to assign distances from the nuclear center for topologically
associated domains. The distance from the nuclear center is described by five concentric radial scopes where 1-to-5 point to the
center–periphery axis. Our in vivo iCLIP data from SF3B1, FUBP1, and U2AF2 was then overlaid with the reported introns and the
percentage of bound introns was counted. Enrichment was calculated as the percentage of bound introns in each radial scope
compared to the first.
Mathematical modeling
Topology of the exon definition model
Splicing reactions are catalyzed by the spliceosome, which recognizes splice site sequences and forms a catalytically active higher-
order complex across introns. To model this process, we considered that human spliceosomes frequently operate by a so-called
"exon definition" mechanism, in which the pioneering spliceosome subunits U1 and U2 cooperatively bind to splice sites flanking
an exon before the final cross-intron complex is formed during spliceosome maturation.86 Because the initial binding of U1 and
U2 plays a decisive role in splicing decisions,86 we model only the initial exon definition step and assume the corresponding binding
patterns determine splicing outcomes, as described below.
In the model pre-mRNA, none of the three exons are bound ("defined") by the spliceosome (white boxes), therefore this state is
denoted "P0_0_0" (Figure S7F) with the notation "_" indicating the presence of an intron. In the model, the pre-mRNA (P0_0_0) is
synthesized at a constant rate s. The spliceosome can bind reversibly to each of the exons with on-rates k1, k2, and k3. For instance,
from P0_0_0 we can obtain P1_0_0, P0_1_0, and P0_0_1 through binding to the first, second, and third exon, respectively. Subse-
quent binding is possible; for example, P1_0_1 can be generated from P1_0_0 with the rate constant k3. In total, there are eight spli-
ceosomal binding states, including the fully bound state (P1_1_1), in which all exons are defined. All binding reactions are assumed to
be reversible, i.e., k4, k5, and k6 are the dissociation rate constants and the reverse of k1, k2, and k3, respectively. For example, in state
P1_1_0, spliceosome dissociation from exon 1 with the rate constant k4 yields the species P0_1_0.
Depending on the exon definition states, splicing decisions aremade, and irreversible splicing reactions are possible. For a splicing
event to occur, we consider that both exons flanking a future splice junction must be defined. For instance, skipping of exon 2 is
possible from P1_0_1 and occurs with the rate constant i12. Likewise, splicing of the first intron occurs from the species P1_1_0
and P1_1_1 (rate constant i1), and splicing of the second intron from P0_1_1 and P1_1_1 (rate constant i2). The inclusion isoform
is generated in two steps, that is, from the subsequent removal of introns 1 and 2 in random order: from the binding state
P1_1_1, intron splicing generates two alternative intermediates in which either of the introns is already spliced (P1_11 or P11_1)
and the retained intron can be further spliced in a subsequent reaction. Splicing of the partially defined species P1_1_0 and
P0_1_1 yields the species P11_0 and P0_11; in these, the spliceosome can further reversibly bind exons 3 and 1, respectively,
and undergo a second splicing reaction toward inclusion. In the model, all terminal splice products are subject to degradation (kincl,
degradation rate constant of inclusion; kskip, skipping; kdr1, first intron retention; kdr2, second intron retention). The degradation rate
constant of the full intron retention isoform is the sum of kdr1 and kdr2, reflecting that either intron may contain a destabilizing prema-
ture stop codon. Model species that can be bound or spliced further (P0_0_0, P1_0_0, P0_1_0, P0_0_1, P1_1_0, P1_0_1, P0_1_1,
P1_1_1, P0_11, P1_11, P11_0, P11_1) are not subject to degradation, but they can be exported from the nucleus with the rate con-
stant kret. This reaction reflects that there is a limited time window for splicing to occur, the intermediates otherwise being terminally
frozen in the corresponding intron retention state. The ordinary differential equations of the model are given in Table S4.
Topology of the intron definition model
Because a subset of human genes are spliced by an intron definition mechanism, we also considered this scenario in a modified
version of our splicing model. In contrast to the exon definition model, the 50 and 30 splice sites of an exon can be bound indepen-
dently of one another in the intron definition model. Furthermore, splicing of an intron is possible as soon as both splice sites flanking
this intron are defined. Hence, definition of two splice sites is sufficient for splicing to occur, whereas in the exon definition model four
splice sites need to be defined (30 and 50 splice sites of the two flanking exons). For the intron definition model, we use a notation for
binding state similar to that for exon definition. For instance, for consistency, we assigned the state in which no spliceosome compo-
nent is bound as P0_0_0. For spliceosome binding to exons 1 and 3, we again considered a single binding reaction, as only the splice
sites flanking the intron of interest are relevant for splicing. Hence, a transition from "0" to "1" in the first position (e.g., P0_0_0 to
P1_0_0) represents a spliceosome binding state downstream of exon 1 (5’ of the first intron), and "0" to "1" in the third position in-
dicates binding upstream of exon 3 (3’ of the second intron). For exon 2, we treat splice-site binding as two separate events. We use
"0" to denote no binding, "a" for upstream binding (e.g., P0_a_0), "b" for downstream binding (e.g., P0_b_0), and "1" for both U2 and
U1 being simultaneously bound (e.g., P0_1_0). Again, the presence or absence of "_" indicates whether or not the intron is removed.
We adopted the same parameter notation, that is, k1/k4 and k3/k6 to describe binding/dissociation at exons 1 and 3, respectively. The
new parameters k2a/k5a (upstream) and k2b/k5b (downstream) were introduced to represent spliceosome binding/dissociation around
exon 2. There are a total of 16 spliceosomal binding states in the intron definitionmodel, with the following additional states not part of
the exon definition model: P0_a_0, P0_b_0, P1_a_0, P1_b_0, P0_a_1, P0_b_1, P1_a_1, and P1_b_1. If both splice sites flanking a
future splice junction are defined, splicing decisions, implemented as irreversible splicing reactions in themodel, can occur. Skipping
of exon 2 is possible from P1_0_1 and occurs with the rate i12. Splicing of the first intron occurs from species P1_a_0, P1_1_0,
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e13
77
ll
OPEN ACCESS Article
P1_a_1, and P1_1_1 (rate i1), and splicing of the second intron occurs from P0_b_1, P0_1_1, P1_b_1, and P1_1_1 (rate i2). The inclu-
sion isoform is generated in two steps: first, intron 1 or 2 is spliced from P1_1_1, generating P1_11 or P11_1, respectively. Second,
the retained intron can be further spliced in a subsequent reaction. Splicing of the partially defined species P1_a_0, P1_1_0, P0_b_1,
and P0_1_1 yields the species P1a_0, P11_0, P0_b1, and P0_11, respectively. To these, the spliceosome can bind further reversibly
with the association rate constants k1, k2a, k2, and k3 (depending on the site of binding), and if the species P1_11 or P11_1 are formed,
a second splicing reaction toward inclusion can occur. All terminal splice products are subject to degradation, for which we adopted
the same assumptions and notation as for the exon definition model. Again, model species that can be bound or spliced further
(P0_0_0, P1_0_0, P0_a_0, P0_b_0, P0_1_0, P0_0_1, P1_1_0, P1_a_0, P1_b_0, P1_0_1, P0_a_1, P0_b_1, P0_1_1, P1_a_1,
P1_b_1, P1_1_1) can be exported from the nucleus with the rate constant kret. The ordinary differential equations of the model are
given in Table S4.
Model simulation and analysis
The differential equations were implemented inMatlab 2020b and solved using ode15s. To analyze splicing outcomes, we assumed a
steady state, and performed numerical simulations over long time periods (t = 1,000,000min) to ensure that the concentrations of the
model species remained constant. Thus, we consider an RNA sequencing experiment, in which gene expression was measured in a
stationary cell population in the absence of any external perturbation. As a measure of splicing outcome, we used the steady-state
concentrations of inclusion and skipping (see also below).
Genome-wide splicing modeling by parameter sampling
The exon definition model consists of 15 kinetic parameters which belong to the following classes of reactions: spliceosome binding
(k1, k2a, k2b, k3), spliceosome dissociation (k4, k5a, k5b, k6), splicing catalysis (i1, i2, i12), and others, which are rates of pre-mRNA syn-
thesis (s), mRNA degradation (kint, kskip, kdr1, kdr2), and terminal intron retention (kret). The values of these parameters were unknown
and likely greatly differ between exons in the human genome. To mimic the heterogeneity of exons in the human genome and to
assess the robustness of our simulation results, we randomly sampled all kinetic parameters in our model 10,000 times. As a refer-
ence parameter set, all parameter values were set to 1, except for kret, kincl, and kskip, which were set to 0.01 to ensure low levels of
intron retention that are typically observed in RNA sequencing datasets. We sampled each parameter in themodel within a +/-seven-
fold range around this reference using Latin hypercube sampling (lhsdesign command inMatlab). We performed simulations for each
parameter realization and calculated PSI = inclusion / (inclusion + skipping) as a measure of alternative splicing. We obtained a PSI
distribution between 0 and 1 that closely resembled the experimentally measured genome-wide PSI in control cells. The same pro-
cedure was applied for intron definition, with the only difference being the number of parameters involved -17 in this case. These
kinetic parameters belong to the following classes of reactions: spliceosome binding (k1, k2a, k2b, k3) and spliceosome dissociation
(k4, k5a, k5b, k6); the remainder are identical to those used for exon definition.
Modeling FUBP1 knockout effects
To reproduce the FUBP1 KO data, we implemented two distinct assumptions about the mechanism of action of FUBP1: that FUBP1
affects late spliceosomal catalysis (i.e., the rate constants i1, i2 and/or i12), or that FUBP1 affects early spliceosomal binding (i.e., the
rate constants k1–k6). For both mechanistic assumptions, we considered that FUBP1 predominantly binds long introns (Figure 6A).
When simulating the effect FUBP1 KO has on splicing catalysis (model 2 in Figure 6A), we assumed that the splicing of short introns is
unaffected, but that KO selectively reduces the splicing rate for the excision of long introns 3.5-fold compared to control. To reflect
different combinations of long and short introns, we considered three scenarios in the FUBP1 KO simulations: (i) for the simulation of
cassette exons flanked by two long introns, we assumed that the FUBP1KO slows all three splicing reactions in themodel, that is, the
excision of intron 1, excision of intron 2 and exon skipping (i1, i2, and i12 are changed). (ii) For exons flanked by one short and one long
intron, it was assumed that the splicing rate of the short intron is unaffected by FUBP1 KO, whereas splicing rates of the long intron
and skipping are reduced. The long intron was either considered to be located upstream of the alternative exon (ii.a: i1 and i12 are
changed) or downstream (ii.b: i2 and i12 are changed). In either case, the skipping reaction was considered as an FUBP1-dependent,
long-range splicing event and was therefore perturbed in the FUBP1 KO simulation (i12 is changed). (iii) The third hypothetical sce-
nario, in which an alternative exon is flanked by two short introns, was not explicitly considered in our simulations, as themodel would
predict no PSI change upon FUBP1 KO in this case. For each parameter sample (hypothetical exon), the KO scenarios i, ii.a, and ii.b
were implemented separately, resulting in three sets of 10,000 KO simulations. For each of these, the PSI changes upon FUBP1 KO
were calculated [DPSI = PSI(KO) – PSI(control)], and the corresponding DPSI distribution (Figure 6B) agrees well with the experi-
mental observation in RNA sequencing experiments. In the alternative FUBP1 KO implementation (model 1 in Figure 6A), we
assumed that FUBP1 promotes initial U2 binding to the 30 splice site. Because the 30 splice site marks the downstream end of an
intron, we assume that the FUBP1 KO reduces spliceosome binding to exons located downstream of long introns. In our model,
a long intron 1, therefore, results in a reduced exon 2 definition rate upon FUBP1 KO (k2 changed 1.7-fold compared to control). Like-
wise, a long intron 2 diminishes exon 3 definition (k3 changed 1.7-fold upon FUBP1KO). These perturbations were implemented alone
(one long and one short intron), or in combination (two long introns), and the corresponding DPSI distributions across all 10,000
parameter realizations are shown in Figure 6B. The perturbation in binding parameters (k2, k3) was chosen to be smaller (1.7-fold)
than the effect on splicing parameters (3.5-fold, model described above) to adjust for similar-sized effects on splicing in both imple-
mentations. In contrast to the FUBP1 KO RNA sequencing data, these spliceosome binding simulations predict opposite PSI
changes for short introns being located upstream or downstream of the alternative exon. Hence, a model in which FUBP1 enhances
the catalytic excision of long introns explains the FUBP1KOdata better when compared to amodel in which FUBP1 primarily helps to
e14 Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023
78
ll
Article OPEN ACCESS
recruit the pioneering U2 subunit to the 30 splice site. The same FUBP1 KO simulations were also implemented in the intron definition
scenario. Here, the effect of FUBP1 on spliceosome binding (model 1 in Figure 6A) was assumed to affect the k2a parameter for a long
upstream intron and k3 for long downstream introns. If both introns are long, FUBP1 influences both k2a and k3. The effect of FUBP1
on splicing catalysis (model 2 in Figure 6A) in the intron definitionmodel was implemented in the sameway as described above for the
exon definition model. For FUBP1-based mechanisms of action, that is, binding and catalysis effects, very similar results were
observed for the intron and exon definition scenarios (Figure S7G). Hence, the model’s prediction that FUBP1 affects splicing catal-
ysis is robust and does not depend on the mechanism of splicing decision making.
Molecular Cell 83, 2653–2672.e1–e15, August 3, 2023 e15
79
2.3.1 Supplementary material
80
Molecular Cell, Volume 83
Supplemental information
FUBP1 is a general splicing factor
facilitating 30 splice site recognition
and splicing of long introns
Stefanie Ebersberger, Clara Hipp, Miriam M. Mulorz, Andreas Buchbender, Dalmira
Hubrich, Hyun-Seo Kang, Santiago Martínez-Lumbreras, Panajot Kristofori, F.X.
Reymond Sutandy, Lidia Llacsahuanga Allcca, Jonas Schönfeld, Cem Bakisoglu, Anke
Busch, Heike Hänel, Kerstin Tretow, Mareen Welzel, Antonella Di Liddo, Martin M.
Möckel, Kathi Zarnack, Ingo Ebersberger, Stefan Legewie, Katja Luck, Michael
Sattler, and Julian König
81
Figure S1
A FUBP1 iCLIP U2AF2 iCLIP SF1 iCLIP SF3B1 iCLIP PTBP1 iCLIP 3‘ ss
5.9% 4.4% 5‘ ss
12.7% 1.6% 8.8% 2.8% 2.4% 3.7% 36.6% Intron
0.2% 34.8%
16.4%
9.0% 0.2% 0.3%
1.9%
3.4% CDS
53.5% 0.0% 57.2%
8.9% 0.1% 10.6% 0.1% 55.6% 0.1% 1.8%
1.6% 2.2% 2.2% 0.0% 0.4%17.9% 5‘ UTR
21.4% 19.7% 1.5% 15.8%43.6% 21.1% 3‘ UTR
19.3% ncRNA
Intergenic 
B C regionsshort short
Saturation curve of downsampled reads VPS13D VPS13D mutatedGGAUUUGUGUCUUUGCUU GGACUCGUGUCCUCGCUU
FUBP1 CUACUUUUCAUCCCUUCU CUACUCUCCAUCCCCUCU
U2AF2 SF1 1 457 FUBP1N-box+KH [nM]
SF3B1 PTBP1
100
FUBP1N-box+KH
-VPS13D
75 complex
Unbound
50 VPS13D
RNA (100 nM)
25
100
80
0 60
0 3 6 9 40
Splice site usage [log2] KD= 0.28 ±20
 (junction read bins) 0.06 μM KD> 6 μM
0
0 5000 10000 0 5000 10000
FUBP1N-box+KH [nM]
D 105 KH1-4 KH1-4 E FUBP1 KH1 KH2 KH3 KH4
KH1 KH2 8
110
4
115
0
120 -4
125 -8 Residue number
130 β1α1α2β2 β' α' β1α1 α2β2 β' α' β1α1 α2β2 β' α' β1α1 α2β2 β' α'
1.0
105 KH1-4 KH1-4
KH3 KH4 0.5
110 0
115 100 150 200 250 300 350 400 450-0.5 Residue number
120 1500
125 1000
130 500
10 9 8 7 10 9 8 7 0
ω - 1H (ppm) ω - 1H (ppm) 100 150 200 250 300 350 400 4502 2
800 Residue number
600
400
200
0
100 150 200 250 300 350 400 450
Residue number  
 
Figure S1. FUBP1 binding at 3' splice sites and RNA binding of KH domains (related to Figure 
1B, 1E and 2B-D) 
(A) Distribution of binding sites across transcript regions for FUBP1 (n = 854,404), U2AF2 (n = 
914,221), SF1 (n = 99,305), SF3B1 (n = 1,694,991), and PTBP1 (n = 127,450) iCLIP in HeLa cells 
(normalized for total transcript length). 3' and 5' splice site (ss) refer to 100 nt upstream and 
downstream of exons, respectively. CDS, coding sequence; UTR, untranslated region.  
1 
 
82
ω - 15N (ppm) ω - 151 1 N (ppm)
Fractions of bound
3‘ splice sites (%)
Fraction of bound
VPS13D RNA (%)
T2 [ms] T1 [ms] hetNOE ΔδCα - ΔδCβ
0
50
100
90 200
400
120 600
800
160 12001600
3200
6400
210 12800
250 0
50
100
300 200
400
600
340 800
1200
1600
390 3200
6400
12800
440
(B) Saturation analysis on downsampled iCLIP data (FUBP1, ~57,000,000 crosslink events; SF3B1, 
~68,000,000; U2AF2, 54,000,000; SF, 58,000,000; PTB, 49,000,000), where the iCLIP data for 
each splicing factor have approximately the same sequencing depth. 
N-
(C) Fluorescent electrophoretic mobility shift assay (EMSA) experiment on recombinant FUBP1
box+KH
 (aa 1–457, 50 nM–12.8 μM) binding to a shortened 36-nt RNA fragment from VPS13D 
(VPS13Dshort, 100 nM) (left). Agarose gel image (top) and quantification (bottom) with fitted curve 
show FUBP1–RNA binding in the nanomolar range (dissociation constant [KD] = 0.28 ± 0.06 μM). 
N-box+KH
Agarose gel of a fluorescent EMSA experiment on recombinant FUBP1  (aa 1–457, 50 nM–
12.8 μM) binding to VPS13Dshort mutated (100 nM) with U-to-C mutations in U-rich stretches 
affording greatly reduced binding (right). 
1 15
(D) Overlays of the H– N heteronuclear single quantum coherence (HSQC) spectra of FUBP1 KH1–
4 (black) with single KH domains (KH1, red; KH2, yellow; KH3, green; KH4, blue). Nuclear 
KH
magnetic resonance (NMR) experiments of FUBP1  (KH1–4) show excellent spectral quality, 
despite the high molecular weight (~40 kDa), allowing most of the backbone chemical shifts to be 
assigned (310 out of 371 residues).  
13 13 15 1 15
(E) Cα and Cβ secondary chemical shifts and N relaxation experiments: { H}- N heteronuclear 
nuclear Overhauser effect (NOE), T1, T2, of the four KH domains of FUBP1. Folded KH domains 
exhibit more rigidity (NOE ~ 0.9, T1 ~ 1s, T2 ~ 60 ms) whereas linker regions are more flexible 
(lower NOE, lower T1, higher T2). 
 
 
2 
 
83
Figure S2
A B
SIA (scaffold-independent analysis) 0.2 + TTTTG 0.3 + TCTGT+ UUUUG 0.2 + UCUGU
+NGNNN 0.1 0.1
1. NANNN +NCNNN 0 090 110 130 150 170 280 300 320 340 360
2. NCNNN +NANNN+NTNNN Residue number Residue number
3. NGNNN Molar ratio1:0 FUBP11:0.5
NTNNN Molar ra
1
1t
:
:io
1
2
4. 1:0 1:41:0.5 KH1 KH2 KH3 KH4
Molar ra1t:1
1:0 1:
io2
1:0.5 1:4
Molar ra1:11ti:o2
1:0
16. NNNNT 1:41:0.51:11:2
1:4 Position 1 0.2 + TTTGT 0.2 + TTTTG
ω 12 - H (ppm) 0.1 + UUUGU 0.1 + UUUUG
Nucleotide
DNA pools NMR preference 0 0180 200 220 240 380 400 420 440
Residue number Residue number
C FUBP1KH1 D FUBP1KH2K106 F114 G120 I134
105 G120 +TTTTG Molar 105 G202 +TTTGT Molar I201 G205 I222 G225ratio V107 I116 I123 Q135 G205 ratio G202 K209 Q223 T2290.6
110 1:0 G225 1:0 0.5
V107 I116 1:0.5 K D= 344.5 ± 43.3 μM
110 1:0.5 K D= 731.9 ± 78.8 μM
115 1:1 1:1 0.4
1:2 0.4 115 T229F114 I201 1:2120 1:4 1:4 0.3120 1:6
125 I134 I123
1:6 0.2
1:8 0.2 1:8
K106 125 Q223 K209 1:10 0.1130 Q135 1:12
0 130 I222 1:16 0
9.5 9 8.5 8 7.5 7 6.5 0 1 2 3 4 5 6 7 8 9 8.5 8 7.5 7 0 2 4 6 8 10 12 14 16
ω - 1H (ppm) [TTTTG]/[FUBP1KH1] ω - H (ppm) [TTTGT]/[FUBP1KH2]2 2
E FUBP1KH3 F
105 G288 G292+TCTGT Molarratio G288 G292 F311 Q324
FUBP1KH4 T388 I392 T398 K400
110 I291 K300 K312 I325
105 G396 Molar L390 G396 I399 I410
1:0 +TTTTG ratio
1:0.5 0.8 K D= 71.3 ± 10.3 μM 110 1:0 0.4 K D= 403.1 ± 35.5 μM115 I291 1:1 T388 1:0.51:2
1:4 0.6
115 1:1
120 F311 T398 I391 1:2
0.3
1:6 120 1:4
K300 1:8 0.4 L390 1:6 0.2
125 Q324 1:10 125 K400 1:8K312 1:12 0.2 1:10 0.1I410 I399
130 I325 0 130 1:12 0
9.5 9 8.5 8 7.5 7 0 2 4 6 8 10 12 10 9 8 7 0 2 4 6 8 10 12
ω - 1H (ppm) [TCTGT]/[FUBP1KH3] ω - 1H (ppm) [TTTTG]/[FUBP1KH4]2 2
G FUBP1KH12 0.6 +TTTGTAAAATTTTG H FUBP1KH23 +TCTGTAAAATTTGT0.6
105 0.40.2 105 0.4
110 0 110
0.2
Molar 100 140 180 220 260 Molar 0
115 ratio
1:0 Residue number
115 ratio 180 220 260 300 340
1:0
120 120 1:0.5 Residue number1:0.5 KD=4.71±1.45 µM 1:1 KD=1.15±0.48 µM
125 1:1 0-20 125
0 0
130 -40 -1.0 130 -40 -1.0
-60 9 8 7 -80 -2.09.5 9 8.5 8 7.5 7 -2.0
ω - 1H (ppm -80) ω
1
2 - H (ppm)
-3.0
2 0 0.8 1.6 0 30 60 0 1.0 2.0 0 30 60
Molar ratio Time [min] Molar ratio Time [min]
I
0.6 +TTTTGAAAATCTGT J FUBP1 in vivo iCLIP
FUBP1KH34 0.4 0.025 AUUU
105 0.2
0 0.020 UUUA
UUUU
110
Molar 280 320 360 400 440 0.015 UUAU
UAUU
115 ratio Residue number UUUC
120 1:0 KD=0.87±0.10 µM
0.010 UUAA AAUU
1:0.5 UUGU UUUG
125 1:1 -0 0 GUUU UGUU-20 0.005
130 UAAU-40 -1.0 0.000
10 9 8 7 -60 -2.0
ω - 1H (ppm) -802 0 0.8 1.6 0 30 60 0.00 0.01 0.02 0.03 0.04
Molar ratio Time [min] Relative motif frequency intop 20% FUBP1 binding sites 
K AAA+C CCC+G UUU+A L AAA+C/G M BRET for N Molar U2AF2RRM2FUBP1/U2AF2/SF1 ratio + FUBP1N-boxAAA+G GGG+A UUU+C CCC+A/G 1:0 I317
CCC+A GGG+C UUU+G GGG+A/C 0.30 1:0.5
0.015 1:1 G319 L279
Binding region 1:2 S2811:4
0.010 FUBP1 0.20 1:6 105
SF3B1
0.005 30 0.10 G265 CT280 115 F282
0.000
20 N0.00 125
-0.005 G3260.000 0.004 0.008 N321
10 9 8 7 L270
-0.010 Acc/Donexpression ratio K276 U2AF2RRM2
-200 -150 -100 -50 BP 0 Tested interaction
Position relative 0 1 2 3 4 5 6 >6 Positive control V275 L325 A316 L320 E277
to branch point (nt) Number of motifs Negative control ω - 12 H (ppm)  
 
3 
 
84
15
Normalized positional ω1 - N (ppm) ω -
15
1 N (ppm) ω
15
1 - N (ppm)
motif frequency ω1 -
15N (ppm)
ω - 151 N (ppm)
ΔH [kJ/mol] CSP [ppm]
ΔH [kJ/mol] CSP [ppm] Δδobserved Δδobserved
Fraction of introns (%)
DP [µW]
DP [µW]
cBRET
ω - 151 N (ppm) ω - 15N (ppm)
ω - 15N (ppm) 1 CSP [ppm] CSP [ppm]1
Motif enrichment 
(top 20% binding sites 
vs non-bound regions)
ω1 -
15N (ppm)
ΔH [kJ/mol] CSP [ppm] Δδobserved
Δδobserved CSP [ppm] CSP [ppm]
DP [µW]
...
Figure S2. Scaffold-independent analysis and titration curves for the final optimal binding motifs 
for each KH domain (related to Figure 2E-I, 3C, F) 
(A) Schematic workflow of the NMR-based scaffold-independent analysis (SIA) [S1]. SIA reports on 
the nucleic acid binding specificity of a given RNA-binding protein (RBP) at each position of a 
nucleic acid target. Sixteen 5-mer DNA pools with one specific nucleotide fixed at one position, 
otherwise randomized, are individually titrated to each KH domain. The observed changes in 
chemical shift of the selected peaks are averaged and normalized for each DNA pool to obtain a 
score for the nucleotide position and type preference. 
(B) Comparisons of chemical shift perturbations (CSPs) of all four FUBP1 KH domains upon addition 
of the optimal nucleotide motifs as either DNA or RNA (1:1 molar ratio of protein to RNA). This 
shows that DNA and RNA binding are very similar for all four KH domains of FUBP1. 
(C) NMR titration and dissociation constant (KD) calculation for the binding of FUBP1 KH1 (100 µM) 
with TTTTG up to a protein/DNA molar ratio of 1:8. The indicated residues are used for the 
calculation of KD. As expected for interactions of KH domains to nucleic acids, the changes in 
chemical shift are mostly mapped to the α1 and α2 helices and the GXXG loop. 
(D) NMR titration and KD calculation for the binding of FUBP1 KH2 (100 µM) with TTTGT up to a 
protein/DNA molar ratio of 1:16. The marked residues are used for the calculation of KD. As 
expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG 
loop. 
(E) NMR titration and KD calculation for the binding of FUBP1 KH3 (100 µM) with TCTGT up to a 
protein/DNA molar ratio of 1:12. The marked residues are used for the calculation of KD. As 
expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG 
loop. 
(F) NMR titration and KD calculation for the binding of FUBP1 KH4 (100 µM) with TTTTG up to a 
protein/DNA molar ratio of 1:12. The marked residues are used for the calculation of KD. As 
expected, the changes in chemical shift are mostly mapped to the α1 and α2 helices and the GXXG 
loop.  
(G) NMR titration and ITC of FUBP1 KH1–2 binding to a DNA oligonucleotide containing an optimal 
DNA motif for each KH domain derived by SIA linked by an A4 linker (TTTGTAAAATTTTG). 
Consistent with the KD values from ITC, the NMR titration indicates binding in an intermediate 
exchange regime. 
(H) NMR titration and ITC of FUBP1 KH2-3 binding to a DNA oligonucleotide containing an optimal 
DNA motif for each KH domain derived by SIA linked by an A4 linker (TCTGTAAAATTTGT). 
Consistent with the KD values from ITC, the NMR titration indicates binding in an intermediate 
exchange regime. 
(I) NMR titration and ITC of FUBP1 KH3–4 binding to a DNA oligonucleotide containing an optimal 
DNA motif for each KH domain derived by SIA linked by an A4 linker (TTTTGAAAATCTGT). 
Consistent with the KD values from ITC, the NMR titrations indicate binding in an intermediate 
exchange regime. 
(J) Motif enrichment in the in vivo FUBP1 iCLIP data. Disjunct 4-mer frequencies were calculated in 
extended windows (5-nt binding site ± 5 nt) around the top 20% of binding sites based on 
expression-normalized iCLIP signal and in non-bound regions in the same introns excluding a 20-
nt region downstream of the 5' ss and a 150-nt region upstream of the branch point (BP). Enrichment 
for each motif is defined as the distance for each data point to the diagonal in the scatterplot of 
relative motif frequencies of the top 20% vs bottom 20% of binding sites. 
(K) Positional enrichment of FUBP1 binding motifs and control motifs relative to the BP. 
UUU+A/G/C, that is, 4-mers containing UUU interspersed at any position with A/G/C. Control 
motif sets are mononucleotide tracts interspersed by one other nucleotide. 4-mer frequencies were 
calculated position-wise upstream of the BP and compared to the average 4-mer frequencies in an 
4 
 
85
intronic control region (a 100-nt-long region 100 nt downstream of the 5' splice site). Shaded 
regions correspond to the main binding regions of FUBP1 (red) and SF3B1 (blue). 
(L) Abundance of FUBP1 binding motifs (UUU+A/G) at 3' ss of human introns. Abundance for other 
mononucleotide motifs (AAA + C/G & AAAA, CCC + G/C & CCCC, GGG + A/C & GGGG) is 
given for the purpose of comparison. 
(M) Total luminescence and fluorescence measurements were used to estimate the amounts of FUBP1 
A38D ΔC W586,615R
or the mutants FUBP1 , FUBP1 , and FUBP1  paired with wild-type U2AF2 and SF1 
(orange), BCL2L1-BAD as a positive control pair (green) and pairs that are not known to interact 
with each other as negative controls (gray) in bioluminescence resonance energy transfer (BRET)-
based assay. Acceptor/donor ratios are similar for all pairs, making the cBRET values more 
comparable to each other. 
 RRM2 N-box(N) NMR titration of U2AF2  with FUBP1  up to sixfold molar excess (left). Significantly 
shifted peaks are enlarged. The peaks with a chemical shift perturbation (CSP) ≥ 0.1 are shown in 
RRM2 
red along with corresponding residues on the structure of U2AF2 (right) (PDB ID: 8P25). 
 
 
5 
 
86
Figure S3
A U2AF2RRM12 B N-box
+ FUBP1 + FUBP1N74 + FUBP1Δ 25 FUBP1 56
115 GGVNDAFKDALQRARQIAAKIGGDAGTSLNSN
A38 Tested FUBP1N-box mutations
120
1 1 2 P-rich A B 644
27 52 00 64 85 51 78 39 76 43 52 05 76 95 04 22
125 N-box
1 1 1 2 2 3 3 4 4 5 5 5 6 6
9 8 9 8 9 8
C D FUBP1N-box + U2AF2RRM2
0.6 FUBP1N-box + U2AF2RRM12
FUBP1N74 + U2AF2RRM12 Molar ratio G47
0.4 1:01:0.5
1:1 110 N
1:2
0.2 1:4
1:6 G46
0 115
0 10 20 30 40 50 60 70 Q36
FUBP1 residue number R39
E 120FUBP1N-box K4406 C
5
No U2AF2 I45
0
4 + U2AF2 A43
A30 125
2 FUBP1N-box
0 8.5 8.0 7.5 7.0
2 25 30 35 40 45 50 55
A38 ω - 12 H (ppm)
A34
FUBP1 residue number
F G A34 A38 A43 G46 H U2AF2RRM12 FUBP1N-box
FUBP1N-box Q36 R39 K44 G47 80 80
0.8 + U2AF2RRM12
0.8 KD= 30.3 ± 5.3 μM 60 60
+ U2AF2linker-RRM2 0.6
0.6 + U2AF2RRM2
0.4 40 40
0.4
0.2 20 20
0.2
0 0 0
0 0 1 2 3 4 5 6 N-box N74 RRM12 linker-25 30 35 40 45 50 55 RRM2 RRM2
FUBP1 residue number Molar ratio[U2AF2RRM2 ]/[FUBP1N-box] FUBP1 U2AF2
I Chimera of U2AF2linker-RRM2 and FUBP1N-box J Chimera of U2AF2linker-RRM2 and FUBP1N-box K
105 FUBP1N-box U2AF2linker-RRM2
GS-linker
G319
+ U2AF2linker-RRM2 105 +FUBP1N-box U2AF2RRM2
Molar ratio
1:0 G47
Molar ratio G265
110 1:0.5 1:01:1 110 1:0.5
1:2 1:1
1:4 G46 1:2
1:6 1:41:6 F282
115
R39 115 N321 L279
Q36 G326 L320
S281
120 I45 K44 120 K276 E277
A38 A43 Q315
125 A34 125 FUBP1N-box
A316
L325
130
130
9 8 7 9 8 7
ω 12 - H (ppm) ω -
1H (ppm)
U2AF2linker-RRM2
2
FUBP1N-box
231 GSGGSGSSGSGGSG 56
342 25  
 
Figure S3. Determination of the minimal interaction interface between FUBP1N-box and 
U2AF2RRM2 (related to Figure 3F-H) 
 RRM2(A) Comparison of a selected region of the one-point NMR titrations (0.5 molar ratio) of U2AF2  
N74 ΔN
(red) with full-length FUBP1 (cyan), FUBP1  (blue), and FUBP1  (black) showing significant 
chemical shift changes. 
(B) Overview of the FUBP1 construct used for NMR and BRET experiments. Red color marks 
mutations that are tested for effect on binding of FUBP1 with U2AF2. 
6 
 
87
ω - 151 N (ppm) CSP [ppm] Strand Helix CSP [ppm]
ΔδCα - ΔδCβ
Chemical shift
differences [ppm]
ω - 151 N (ppm)
ω - 151 N (ppm)
KD [µM]
76 μM
71 μM
76 μM
79 μM
82 μM
(C) N-boxComparison of CSPs upon the titration of a shortened FUBP1 N-terminal construct (FUBP1 , 
N74 RRM12
aa 25–56; green) and FUBP1  (blue) with U2AF2 . 
 N-box RRM2(D) NMR titration of FUBP1  with U2AF2  up to sixfold molar excess (left). Significantly 
shifted peaks are boxed and enlarged. The peaks with a CSP ≥ 0.1 are highlighted in red along with 
N-box 
corresponding residues on the structure of FUBP1 (right) (PDB ID: 8P25). 
 N-box(E) Comparison of the Cα and Cβ chemical shift-derived secondary structure of free FUBP1  (blue) 
N-box RRM2
and FUBP1  bound to U2AF2  (orange). The fractional helical conformation for residues 
30–45 in the absence of U2AF2 is further increased upon binding to U2AF2. 
N-box
(F) Comparison of the CSP of FUBP1  titrations with U2AF2 constructs of various lengths 
RRM12 linker-RRM2 RRM2
(U2AF2 , black; U2AF2 , orange; U2AF2 , light blue). 
(G) N-box RRM2Calculation of KD for the FUBP1 and U2AF2  interaction derived by NMR titration. The 
N-box RRM2
changes in chemical shift of selected residues in the titration of FUBP1  with U2AF2  
(shown in panel D) are plotted against the molar ratio of ligand to titrant. 
N-box N74 
(H) Comparison of KD values for the interaction of FUBP1 (black) and FUBP1 (blue) with 
RRM12 N-box RRM12 linker-RRM2 
U2AF2 and FUBP1 with U2AF2 (black), U2AF2 (orange), 
RRM2 
and U2AF2 (light blue), determined by ITC (Table S2). The measurements were performed 
in triplicates and data are represented as mean ± SD. 
 1 15 linker-RRM2 N-box(I) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1  (cyan) 
N-box linker-RRM2
and FUBP1  titrated with U2AF2  (molar ratio of 1:6). 
 1 15 linker-RRM2 N-box(J) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1  (cyan) 
linker-RRM2 N-box 
and U2AF2  titrated with FUBP1 (molar ratio of 1:6). 
 linker-RRM2(K) NMR ensemble (10 lowest energy structures) of the chimeric construct U2AF2  
N-box
(green)/FUBP1  (brown). The end of the flexible linker between RRM1-RRM2 (231–245) is 
not shown, the artificial GS-linker between the C terminus of U2AF2 RRM2 and the N-terminal 
N-box
region of FUBP1  are indicated by gray dashed lines (PDB ID: 8P25). 
 
7 
 
88
Figure S4
A C RRM2 D
PUF60 RRM2 U2AF2 N NM titration with 2 2
M2
PUF60 ---EARAFNRIYVASVHQDLSDDDIKSVFEA 248 U2AF2 RRM2 Molar ratio
2 2 STVVPDSAHKLFIGGLPNYLNDDQVKELLTS 281 1:0
1:0.5
1:1
PUF60 FGKIKSCTLARDPTTGKHKGYGFIEYEKAQS 279 1:2
2 2 FGPLKAFNLVKDSATGLSKGYAFCEYVDINV 312
12
12 WT
PUF60 SQDAVSSMNLFDLGGQYLRVGKAVTPPMPLL 310
2 2 TDQAIAGLNGMQLGDKKLLVQRASVGAKNAT 343 C G47C 12I45F 12
FUBP1N-box 12
B N-box 12
FUBP1 1 1 2 P-rich A B 644 1212
Oligodendroglioma
12
Chronic lymphocytic leukemia 12
40 50 Uterine endometrioid carcinoma M 12
12 I45F
Thyroid carcinoma FUBP1 N-box in
Neuroendocrine tumor 12
Lung adenocarcinoma 12 G47C
Colon Adenocarcinoma
Melanoma 8.0 7.9 7.8 7.7
1
Unknown ω2 - (ppm)
E F G
Chimera o 2 2 M2 Chimera o 2 2 M2 M2 N-box
and FUBP1 Chimera o 2 2 and FUBP1and FUBP1
FUBP1 Chimera o 2 2
M2 and FUBP1
Linker M2 FUBP1N-box
2 2 M2
105 105 +FUBP1N-box β1 α1 β2 β α2β β
Molar ratio 800
110 1:0110 1:0.5 700
1:1
1:2 600
115 115 1:41:6 500
120 120
400
125 125 2
1 0 1001 0
0
9 8 7 9 8 7 22 2
ω2 -
1 (ppm) ω - 12 (ppm) e number
H IFUBP1 + SF1 FUBP1 /SF1
FUBP1 + SF1
2 1 2
2 Negative control
0.1
50 0.10
0.0 0.00
0.00 0.01 2 0.000 0.004 0.008
Acc/Don
 
 
Figure S4. The effects of cancer-related mutations in the FUBP1 N-box on the interaction with 
U2AF2 RRM2 (related to Figure 3C, 3I-J) 
(A) Sequential alignment (Clustal Omega [S4]) of the RRM2 domains of human PUF60 and U2AF2, 
mapping conserved residues (red) and similar residues (orange). Overlay of the structure of PUF60 
RRM2 (black) and U2AF2 RRM2 (green) (adapted from [S2] and [S3]; PDB IDs: 2KXH, 6TR0). 
(B) FUBP1 N-box mutations identified in different cancer types. Databases (see STAR Methods) were 
screened for the occurrence of cancer-related mutations within the region of FUBP1 encoding for 
the N-box, yielding one insertion, five frameshifts (fs) leading to a premature termination codon 
(*) and 20 missense variants. 
8 
 
89
ω - 151 N (ppm)
1
M 2
I41F
M I45F
G47C
A49G
G50A
T51A
2
N54K
ω1 -
15N (ppm)
T2
ω 151 - N (ppm)
(C) Cancer-related mutations (labeled and side chains shown on the calculated structure of a chimeric 
RRM2 N-box
construct of U2AF2  and FUBP1 , PDB ID: 8P25) within the helical binding region of 
N-box RRM2
FUBP1  and located at the interfaces with U2AF2  were selected for further NMR study. 
N-box
(D) Comparison of the changes in chemical shift of residue A34 for the titration of FUBP1  wild-
RRM2
type and mutants (L35V, A38D, A43E, K44R, I45F, G47C) upon adding U2AF2 . 
 1 15 N-box(E) Overlay of the H– N HSQC spectra of the A38D mutant of FUBP1  (red) with the chimeric 
linker-RRM2 N-box
construct U2AF2 /FUBP1  with A38D mutation (cyan). 
 1 15 linker-RRM2 N-box(F) Overlay of the H– N HSQC spectra of the chimeric construct U2AF2 /FUBP1  with 
N-box linker-RRM2
A38D mutation (cyan) with the titration of FUBP1  with U2AF2  shows that the mutant 
spectrum resembles those of the unbound individual components. 
 15 linker-RRM2 N-box(G) Comparison of the N T2 relaxation rates of the chimeric constructs U2AF2 /FUBP1  
wild-type and A38D mutant. Increased T2 relaxation rates in the N-box helix of the A38D mutant 
chimera compared to the wild-type is consistent with much weaker binding of the mutant to the 
U2AF2 RRM2. 
A38D
(H) BRET titration curves shown for FUBP1 and FUBP1  versus SF1. As expected, mutation of the 
FUBP1 N-box does not result in significant loss of binding to SF1. Two biological replicates are 
shown, each done in technical triplicates. Error bars represent the standard deviation. 
(I) Total luminescence (Don) and fluorescence (Acc) ratios were determined for FUBP1 and 
A38D
FUBP1  versus SF1. Acceptor/donor ratios are similar for all pairs making the cBRET values 
more comparable to each other. 
 
  
9 
 
90
Figure S5
A in vitro iCLIP oligo signal correlation B in vitro iCLIP peak signal correlation C
U2AF2 50 nM + U2AF2 50 nM +
10 D
5
1 1
Rep1 1 0.95 0.94 0.87 0.87 0.87 0.86 0.87 0.88 Rep1 1 0.89 0.88 0.82 0.82 0.82 0.77 0.77 0.78
0.98 0.98
Rep2 1 0.96 0.88 0.89 0.88 0.88 0.89 0.89 0
0.95 Rep2 1 0.91 0.85 0.86 0.85 0.81 0.81 0.82 0.95
Rep3 1 0.88 0.88 0.88 0.87 0.88 0.89 0.92 Rep3 1 0.84 0.84 0.84 0.79 0.79 0.8 0.92
FUBP1 50 nM Rep1 1 0.95 0.95 0.89 0.91 0.91 0.9 FUBP1 50 nM Rep1 1 0.92 0.93 0.86 0.85 0.87 0.9
FUBP1 50 nM Rep2 1 0.95 0.9 0.91 0.92 0.88 FUBP1 50 nM Rep2 1 0.93 0.86 0.86 0.87 0.88
FUBP1 50 nM Rep3 1 0.89 0.91 0.91 0.85 FUBP1 50 nM Rep3 1 0.86 0.86 0.87 0.85
FUBP1 300 nM Rep1 1 0.93 0.94 0.82 0.82FUBP1 300 nM Rep1 1 0.88 0.88
0.8
FUBP1 300 nM Rep2 0.81 0.94 FUBP1 300 nM Rep2 1 0.9
0.78 0.78
 FUBP1 300 nM Rep3 1  FUBP1 300 nM Rep3 1
0.75 0.75
D in vitro iCLIP per nucleotide correlation E in vitro iCLIP peak signal correlation F
4
1 1
U2AF2 Rep1 1 0.95 0.95 0.86 0.89 0.92 0.92 0.95 0.95 U2AF2 Rep1 1 0.99 0.99 0.86 0.89 0.83 0.83 0.99 0.98
0.98 0.98
U2AF2 Rep2 21 0.95 0.87 0.9 0.92 0.92 0.95 0.95 U2AF2 Rep2 1 0.99 0.87 0.9 0.85 0.85 0.99 0.99
0.95 0.95
U2AF2 Rep3 1 0.87 0.89 0.93 0.92 0.95 0.96 U2AF2 Rep3 1 0.89 0.9 0.87 0.87 0.99 0.990.92 0.92
U2AF2 FL Rep1 1 0.96 0.95 0.95 0.87 0.88 U2AF2 FL Rep10.9 1 0.98 0.97 0.97 0.89 0.89 0.9 0
U2AF2 FL Rep2 1 0.95 0.95 0.89 0.9 0.88 U2AF2 FL Rep2 1 0.95 0.95 0.91 0.91 0.88
U2AF2 ΔN Rep1 1 0.98 0.92 0.94 0.85 U2AF2 ΔN Rep1 1 0.99 0.88 0.88 0.85
0.82 U2AF2 ΔN Rep2 1 0.87 0.88 0.82U2AF2 ΔN Rep2 1 0.92 0.93
U2AF2 N74 Rep1 0.81 0.96 U2AF2 N74 Rep1
0.8
1 0.99
0.78 0.78
U2AF2 N74 Rep1 1 U2AF2 N74 Rep1 1
0.75 0.75
G H Up: 904 Down: 410 n.s.
N NLS 1 2 Pro A B
WT 5
Allele 1+2 15 AGGGGGGGGGGGVNDAFKDALQRARQIAAKIGGDAGTSLNSN...
A38 MYCNboxmut 0
Allele 1+2 15 AGGGGGGGGGGGVNDAAESRKLT---IAAKIGGDAGTSLNSN... U2AF2
KO FUBP1-5
Allele 1 15 AGGGGGGGGGGGVNDAD---CSKNWR*
Allele 2 15 AGGGGGGGGGGGVNDAGPADCSKNWR* -10
FUBP1 N-box 0 2 4 6 8 10 12 14 16 18
Mean expression [log2]
I FUBP1 KO J FUBP1-NboxmutConst. control exons 
Control exons sxon
Exons more included in FUBP1 KO  e
ntr
ol s 
Const. exons less included in FUBP1 KO n dCo Ex
o deu
Exons less included in FUBP1 KO e in
cl
ns r o ed
10 1,000 100,000 xmo E lud 10 1,000 100,000c
Minimum intron length (nt) s ins Minimum intron length (nt)le  
10 
 
91
U2AF2 50 nM +
RPE1 cell lines
Rep1
U2AF2 Rep1
Rep2
U2AF2 Rep2
Rep3
U2AF2 Rep3
FUBP1 50 nM Rep1
U2AF2 FL Rep1
FUBP1 50 nM Rep2
U2AF2 FL Rep2
FUBP1 50 nM Rep3
U2AF2 ΔN Rep1
FUBP1 300 nM Rep1
U2AF2 ΔN Rep2
FUBP1 300 nM Rep2
U2AF2 N74 Rep1
FUBP1 300 nM Rep3
U2AF2 N74 Rep1
U2AF2 50nM +
Rep1
U2AF2 Rep1 Rep2
U2AF2 Rep2 Rep3
U2AF2 Rep3 FUBP1 50 nM Rep1
U2AF2 FL Rep1 FUBP1 50 nM Rep2
Fold change [log ] U2AF2 FL Rep22 FUBP1 50 nM Rep3
  KO/WT U2AF2 ΔN Rep1 FUBP1 300 nM Rep1
U2AF2 ΔN Rep2 FUBP1 300 nM Rep2
U2AF2 N74 Rep1 FUBP1 300 nM Rep3
U2AF2 N74 Rep1
U2AF2 peak signal 
U2AF2 peak signal (fold change over U2AF2 alone [log2])
(fold change over U2AF2 alone [log2])
U2AF2 +   50 nM FUBP1
U2AF2 + FUBP1N74 
U2AF2 + FUBP1 N U2AF2 + 300 nM FUBP1
U2AF2 + FUBP1FL 
n.s.
n.s.
n.s*.**
* **
Figure S5. Reproducibility between replicates and changes in U2AF2RRM12 binding from in vitro 
iCLIP experiments and expression and splicing changes upon FUBP1 KO (related to Figure 4A-
F, 5A-B) 
(A) Reproducibility of in vitro iCLIP data with oligonucleotide-derived transcript library. The 
RRM12
correlation matrix shows pairwise Pearson correlation of U2AF2  crosslink events per 
RRM12
oligonucleotide (n = 1,998) between samples. Experiments were performed with U2AF2  alone 
(50 nM) and with the addition of full-length FUBP1 at 50 or 300 nM. 
(B) Reproducibility of in vitro iCLIP data with oligonucleotide-derived transcript library. The 
RRM12
correlation matrix shows pairwise Pearson correlation of total U2AF2  crosslink events inside 
U2AF2 binding sites between samples (1,831 oligonucleotides harbor a U2AF2 binding sites 
according to U2AF2 in vivo iCLIP). Experiments as in panel A. 
 RRM12(C) Comparative boxplot of normalized U2AF2  crosslink events per binding site between 
conditions (n = 1,504). Experiments as in panel A. 
(D) Reproducibility of in vitro iCLIP data with eight long in vitro transcripts [S5]. The correlation 
RRM12
matrix shows pairwise Pearson correlation of U2AF2  crosslink events per nucleotide over all 
in vitro RRM12 transcripts between samples. Experiments were performed with U2AF2  alone (50 nM) 
FL N74 ΔN
and with the addition of full-length FUBP1 , FUBP1 , and FUBP1  (all 50 nM). 
(E) Reproducibility of in vitro iCLIP data with eight long in vitro transcripts [S5]. Correlation matrix 
shows pairwise Pearson correlation of total binding signals (n = 109) between samples. 
Experiments as in panel D. 
RRM12
(F) Comparative boxplot of normalized U2AF2  crosslink events between conditions (n = 109). 
Experiments as in panel D. 
N-box
(G) Zoom-in of the FUBP1  sequence, which when targeted with CRISPR/Cas9 results in a 
mut
knockout cell line (FUBP1 KO) and a mutant cell line (FUBP1-Nbox ), in which FUBP1 lacks 
the U2AF2 interaction surface. 
(H) Log2 fold change versus mean expression for genes upon FUBP1 KO in RPE1 cells. 
(I) Minimum adjacent intron length for cassette exons that are more or less included and for 
mut
constitutive exons less included in FUBP1-Nbox  RPE1 cells (n = 123/249/27) compared to 
unchanged control exons (n = 4,584) and unchanged constitutive control exons (n = 5,717). 
mut
(J) Minimum adjacent intron length for cassette exons that are more or less included in FUBP1-Nbox  
RPE1 cells (n = 36/45) compared to unchanged control exons (n = 10,678). 
 
 
11 
 
92
Figure S6
A B 1
1 1 1 1 1 1
C
1
in vivo C
MPDZ
1 N NLS 1 2 Pro A B 644
FUBP1 FUBP1 2 1 MPDZ
ΔBS
MPDZΔintron
MPDZΔintron+ΔBSC 1 M
1 D MPDZ MPDZΔBS MPDZΔintron MPDZΔintron+ΔBS
1 80
MPDZ 60
KD 2
2 M
d 0
MPDZ 0 1 2
1 M 1 M WT
ut T tm mu T mutW W WT
ut
x x x xm
-boN N-
bo bo bo
N- N-
E ' F '
M M
2 0.0
2 1 1 1 1
0.0 1 11
1 1 2 1
2 1
1 1 2
1
2 1 1 BP 2 M 2 1 1 BP 2 M
' '
G H M
M 0.6
2 2
1 1 21
1 2 1 1 1 1
0.0 1 1
1 2 2 1 0.0 1
2
M
2 1 1 BP 2 2 1 1 BP 2 M
I J 1 ' '
1 1
2 1
1 2
1 1 2
0 2 1 2 2 21
0 10.00
K 1 2 1 0 1 2 1 0 11
0 2 1 2
0.00
0 1 0 1 1
L 2 M N
1
1 1 1 2
2 2 2 2
0 1 1
1 1 1
1 1 2
1 1
2 0 1 2
1 2
0 1 2
C C  
12 
 
93
C C
2 C 2
11
1 2
0 1
2 11
1 M 2
1 2
1 221
2 21
2
2 2 12
600 1
1
800 1
12 C 2
2 2 1
2
M
o
MPDZ
12M 1
1 11
2
22
C
C C C
2 2
Figure S6. FUBP1 effects on long introns (related to Figure 5C-H) 
(A) Position and identity of FUBP1 loss-of-function (LoF) mutations in glioma patients with 1p/19q 
deletion-positive background [S6].  
(B) Genome browser view of the region included in the MPDZ minigene displaying the in vivo iCLIP 
data (crosslink events per nucleotide) of FUBP1 (orange). Deletions of introns with/without FUBP1 
binding sites are indicated below with red bars. 
N-box+KH 
(C) EMSA experiment to demonstrate binding of recombinant FUBP1 (aa 1–457, 25–3200 nM) 
to a fluorescently labeled 132-nt RNA fragment from MPDZ (100 nM). Agarose gel image (bottom) 
and quantification (top) with fitted curve show FUBP1–RNA binding in a nanomolar range (KD = 
0.23 ± 0.03 μM). 
(D) Capillary electrophoresis of exon inclusion levels upon intron shortening in the MPDZ minigene. 
(E) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in 
dependency on 3' splice site strength. iCLIP signals are normalized for expression and then 
averaged per nucleotide over all introns (left). Binding enrichment quantification: Area under the 
curve (AUC) in each intron class compared to the AUC in introns with very low 3' splice site 
strength (right). 
(F) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in 
dependency on 5' splice site strength. iCLIP signals are normalized for expression and then 
averaged per nucleotide over all introns (left). Binding enrichment quantification: AUC in each 
intron class compared to the AUC in introns with very low 5' splice site strength (right). 
(G) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in 
dependency on Py tract strength. iCLIP signals are normalized for expression and then averaged 
per nucleotide over all introns (left). Binding enrichment quantification: AUC in each intron class 
compared to the AUC in introns with very low Py tract strength (right). 
(H) Metaprofile showing the number of crosslink events of FUBP1 relative to the branch point in 
dependency on BP strength. iCLIP signals are normalized for expression and then averaged per 
nucleotide over all introns (left). Binding enrichment quantification: AUC in each intron class 
compared to the AUC in introns with very weak BP strength (right). 
(I) Fraction of introns with 0, 1, 2, 3, or > 3 motif sets of size 9 of random 4-mers in dependency on 
intron length. Random sets were drawn 100 times and the resulting fractions were then averaged. 
(J) Cumulative distribution of splice site features conditioned on intron length. 
(K) Number of FUBP1-binding motifs upstream of the BP ([−100 nt; −26 nt]) in dependency on 
differential GC content. Differential GC content is the GC content of the exon minus that of the 
first 100 nt of the downstream intron. 
(L) Enrichment of FUBP1 binding upstream of the branch point in dependency on exon/intron GC 
content and exon rank. In the underlying metaprofiles, iCLIP signals are normalized for expression 
and then averaged per nucleotide over all introns. 
(M) Fraction of introns with 0, 1, 2, 3 or > 3 motif sets of size 9 of random 4-mers in dependency on 
differential GC content. Random sets are drawn 100 times and resulting fractions are then averaged. 
(N) Percent of introns bounds through different scopes of Euclidean distances where 1 means the 
nuclear center and 5 is the periphery. Enrichment is shown compared to the first scope. Based on 
data from [S7]. 
 
 
13 
 
94
Figure S7
A 2500 GC architecture B C
1.0 0.752000 FUBP1 Upstream exon
1500 Differential SF10.5 0.50 IntronSF3B1
1000 U2AF65
0.0 PTBP1 0.25
500
-200 -150 -100 -50 BP Leveled
Position relative to 
branch point (nt) Leveled Differential Leveled DifferentialGC architecture GC architecture
D E
GC 
architecture Intron length Mammalian vertebrates
[  100,     400]
Differential (  400,   1000]
(1000,   2000] Invertebrates
(2000,   4000] Fungi
Leveled (4000, 17000] Plants
3 Protozoa
2
1 10 100 1,000 10,000
1 Median intron length
0 F Synthesis Unbound Bound Intron
-1 s 0 1
-2 Unspliced Pre-mRNAIntron1 Intron2 Exon 1 Exon 2 Exon 3intermediates 0 0 0 Binding k
Exon1 1
k2 k3
Exon2 Exon3
k k4 k5 k6 Unbinding
k4 k5 k6
1 k k32
1 0 0 0 1 0 0 0 1 Full intron
GC kret retention
Intron length architecture 1 1 0 1 0 1 0 1 1 kdr1+kdr2
G i1 i2 Degradation1 1 1
Exon definition i
Intron definition k 1 1 0 i 2 0 1 1dr2 k 1 i k12 k dr1
k 6 4Model prediction 3 k11 1 1 1 1 1
Model 1 Model 2 i2 iSecond intron 1 First intron 
0.1 retention retention
0.1 kincl Inclusion kskip SkippingDegradation
0.0 0.0 Degradation
-0.1 H BRET for I J
-0.1 U1 proteins + FUBP1
0.30 FUBP1A/B N-term-WW12-0.2 PRPF40B
Tested +SNRPA +FUBP1P-rich+A/B
0.25 interaction-0.2 -0.3 Positive ctrl Molar ratio Molar ratio
0.20 Negative ctrl 110 1 :0
110 1 :0
t t g t t g 1 :0.25 1 :0.25or or n or or n 1 :0 .5 115 1 :0 .5115
1 s
h
2 s
h s l
o o 0.10
  n  1 
sh  sh  l
n n 2 n
s
ro ro ntr
o n  o or ron ntr 0.15
120
120
Int Int h it In
t Int io oth
 
0.05 125
B B 125
0.00 130
K 5R 0.00 0.02 0.04 10 9.0 8.0 7.0 10 9.0 8.0 7.0
,61 1 1
D 86 Acc/Don expression ratio ω - H (ppm) ω2 - H (ppm)FL A38 ΔN W5 21 1 1 1 1Δ
C
UB
P P
UB UB
P BP BP
-F F F -F
U
- - -FUP
GF GF
P
GF
P FP FP PG G GF L GFP- GFP- GFP- GFP-FUBP1 GFP-134 GFP FUBP1FL FUBP1A38D FUBP1ΔN W586,615R FUBP1ΔC
100
FUBP1 WT WT KO WT KO WT KO WT KO WT KO WT KO
80
700
134
FUBP1 KO 100 500
80 400
Inclusion
300 Skipping
α-FUBP1  
14 
 
95
RPE1 cell line Splicing change Binding 
upon FUBP1 KO [ PSI] enrichment [log ] Normalized2 iCLIP signal
[  100,    400]
(  400,  1000]
(1000,  2000]
(2000,  4000]
(4000,17000]
cBRET Degradation
Binding 
enrichment [log2]
ω - 151 N (ppm)
GC content
ω - 15N (ppm) Kinetics of1
Degradation
Figure S7. Characterization and modeling of FUBP1 binding behavior (related to Figure 5E, 5I-
J, 6A-B, 6G-I) 
(A) Metaprofile showing the number of crosslink events of FUBP1 relative to the BP in dependency 
on differential GC content. iCLIP signals are normalized for expression and then averaged per 
nucleotide over all introns. 
(B) Binding enrichment quantification: AUC in each intron class compared to the AUC in introns 
with leveled GC content. 
(C) Comparison of exon and intron GC content for exons with increasing differential GC content 
architecture. 
(D) Enrichment of FUBP1 binding upstream of the branch point in dependency on intron length and 
differential GC content. Exons were classified into each intron length groups and then split by 
GC content architectures (left panel) and vice versa (right panel). 
(E) Intron length distribution between kingdoms. Analyses were performed for 174 mammals, 274 
non-mammalian vertebrates, 277 invertebrates, 410 fungal species, 94 protozoa, and 145 plants. 
(F) Detailed scheme of the mathematical model describing exon definition and splicing for a 
cassette exon flanked by two constitutive exons. After pre-mRNA synthesis (s), the three exons 
(indicated by boxes) can be cooperatively and reversibly bound by the pioneering spliceosome 
subunits U1 and U2 (these are not explicitly displayed in the scheme). Colorless (0) and colored 
(1) squares represent bound ("defined") and unbound ("undefined") exons, respectively. Red, 
green, and blue arrows represent binding to and dissociation from exons 1, 2, and 3, respectively, 
where k1–k3 are the corresponding rate constants of binding and k4–k6 the rate constants of 
dissociation. Based on the exon definition patterns (highlighted by red ellipse), splicing 
decisions towards multiple splice isoforms (inclusion, skipping, first intron retention, second 
intron retention, full intron retention) are made, and it is assumed that an intron can be excised 
if the two neighboring exons are defined. For instance, skipping of exon 2 is possible from the 
state 1_0_1 and occurs with the rate i12. Likewise, splicing of the first intron occurs from the 
species P1_1_0 and P1_1_1 (rate i1), and splicing of the second intron from P0_1_1 and P1_1_1 
(rate i2). The inclusion isoform is generated in two steps, i.e., from the subsequent removal of 
intron 1 and intron 2 in random order. All terminal splice products are subject to degradation 
(kincl: degradation rate constant of inclusion, kskip: skipping, kdr1: first intron retention, kdr2: second 
intron retention, kdr1+kdr2: full intron retention). 
(G) The intron and exon definition models show similar splicing changes upon FUBP1 knockout 
(KO). We simulated the splicing changes upon FUBP1 KO based on the assumption that FUBP1 
affects the rate of spliceosome binding to the 3’ splice site of long introns (left panel) or the rate 
off splicing catalysis across long introns (right panel) as described in detail in the STAR 
Methods. To account for the heterogeneity of exons in the human genome, we randomly 
sampled the kinetic parameters of the model 10,000 times to generate an ensemble of 10,000 in 
silico exons. We then simulated FUBP1 KO for each in silico exon, assuming that FUBP1 
selectively enhances the rate of splicing for long introns, and considered three scenarios 
reflecting different length configurations of upstream and downstream introns (see STAR 
Methods for details). The boxplots show the distributions of ΔPSI = PSI(KO) – PSI(control) 
values for exon (red) and intron definition (blue) across all exons.  
(H) Total luminescence and fluorescence measurements were used to estimate the amount of FUBP1 
paired with the components of U1 complex (orange), BCL2L1–BAD as a positive control pair 
(green) and pairs that are not known to interact with each other as negative controls 
15 
 
96
(gray). Acceptor/donor ratios are similar for all pairs making the cBRET values more 
comparable to each other. 
 1 15 A/B(I) H– N HSQC spectra of the titration of FUBP1  with SNRPA up to a molar ratio of 1:1.  
(J) 1 15 N-term-WW12  P-rich+A/BH– N HSQC spectra of the titration of PRPF40B  with FUBP1  up to a molar 
ratio of 1:0.5. 
(K) Western blot to verify FUBP1 construct expression after transfection of RPE1 WT and FUBP1 
KO cells. 
(L) Capillary electrophoresis of exon inclusion levels of the MPDZ minigene after transfection of 
RPE1 WT and FUBP1 KO cells with different FUBP1 constructs. 
 
  
16 
 
97
Supplementary Tables 
 
Table S2. Binding affinities and stoichiometries determined by ITC experiments (related to 
Figures 2D, 2F, S2G–I, and S3H). Experiments were performed for different FUBP1 N-terminal 
N-box N74 RRM12 linker-RRM2
constructs (FUBP1 , FUBP1 ) with U2AF2 constructs (U2AF2 , U2AF2  and 
RRM2 KH12 KH23 KH34 
U2AF2 ) and various FUBP1 KH domain constructs (FUBP1 , FUBP1 , FUBP1 and 
KH
FUBP1 ) with DNA or RNA. 
Analyte Titrant N sites KD [μM] Repeats 
RRM12 N-box
U2AF2  FUBP1  0.98 ± 0.17 75.93 ± 2.70 3 
RRM12 N74
U2AF2  FUBP1  0.85 ± 0.07 70.57 ± 2.41 3 
linker-RRM2 N-box
U2AF2  FUBP1  0.91 ± 0.16 78.97 ± 2.16 3 
RRM2 N-box
U2AF2  FUBP1  0.87 ± 0.16 82.03 ± 2.98 3 
KH12
FUBP1  TTTGTAAAATTTTG 0.78 ± 0.07 4.71 ± 1.45 3 
KH23
FUBP1  TCTGTAAAATTTGT 0.76 ± 0.09 1.15 ± 0.48 3 
KH34
FUBP1  TTTTGAAAATCTGT 0.74 ± 0.04 0.87 ± 0.10 3 
VPS13D KH RNA FUBP1  1.38 ± 0.04 0.428 ± 0.062 3 
 
17 
 
98
Supplementary References 
[S1] Beuth, Barbara, María Flor García-Mayoral, Ian A. Taylor, and Andres Ramos. 2007. “Scaffold-
Independent Analysis of RNA-Protein Interactions: The Nova-1 KH3-RNA Complex.” Journal of 
the American Chemical Society 129 (33): 10205–10. 
[S2] Kang, Hyun-Seo, Carolina Sánchez-Rico, Stefanie Ebersberger, F. X. Reymond Sutandy, Anke 
Busch, Thomas Welte, Ralf Stehle, et al. 2020. “An Autoinhibitory Intramolecular Interaction 
Proof-Reads RNA Recognition by the Essential Splicing Factor U2AF2.” Proceedings of the 
National Academy of Sciences of the United States of America 117 (13): 7140–49. 
[S3] Cukier, Cyprian D., David Hollingworth, Stephen R. Martin, Geoff Kelly, Irene Díaz-Moreno, 
and Andres Ramos. 2010. “Molecular Basis of FIR-Mediated c-Myc Transcriptional Control.” 
Nature Structural & Molecular Biology 17 (9): 1058–64. 
[S4] Madeira, Fábio, Young Mi Park, Joon Lee, Nicola Buso, Tamer Gur, Nandana Madhusoodanan, 
Prasad Basutkar, et al. 2019. “The EMBL-EBI Search and Sequence Analysis Tools APIs in 
2019.” Nucleic Acids Research 47 (W1): W636–41. 
[S5] Sutandy, F. X. Reymond, Stefanie Ebersberger, Lu Huang, Anke Busch, Maximilian Bach, 
Hyun-Seo Kang, Jörg Fallmann, et al. 2018. “In Vitro iCLIP-Based Modeling Uncovers How the 
Splicing Factor U2AF2 Relies on Regulation by Cofactors.” Genome Research 28 (5): 699–713. 
[S6] Seiler, Michael, Shouyong Peng, Anant A. Agrawal, James Palacino, Teng Teng, Ping Zhu, Peter 
G. Smith, Cancer Genome Atlas Research Network, Silvia Buonamici, and Lihua Yu. 2018. 
“Somatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences 
across 33 Cancer Types.” Cell Reports 23 (1): 282–96.e4. 
[S7] Tammer, Luna, Ofir Hameiri, Ifat Keydar, Vanessa Rachel Roy, Asaf Ashkenazy-Titelman, 
Noélia Custódio, Itay Sason, et al. 2022. “Gene Architecture Directs Splicing Outcome in 
Separate Nuclear Spatial Regions.” Molecular Cell. Elsevier. 
 
18 
 
99
However, I performed the cloning of the ORFs, site-directed mutagen-
esis and generated constructs in the low-throughput (eppi tube format).
Next, I adapted these steps in medium-throughput (plate format). The
protocol is described in Appendix, 5.1. The final pipeline followed by
BRET-assay was tested in the second collaborative project see Article
II.
2.4 Article II: Systematic discovery of protein in-
teraction interfaces using AlphaFold and exper-
imental validation
Summary
This project is focused on benchmarking AlphaFold-Multimer (AF-
MM), its metrics and the application to identify novel protein interaction
interfaces followed by experimental validation.
AF-MM is a machine learning-based tool to predict structures of pro-
tein interactions and complexes. While this tool was tested to predict
different PPI interface types, there is a general lack of a comprehensive
assessment of sensitivity and specificity and the potential biases of the
tool and its metrics. The benchmarking of AF-MM is essential for its
application in the prediction of PPI interfaces. Therefore, we system-
atically benchmarked the tool’s ability to predict domain-domain and
domain-motif interfaces. The predicted models were compared to the
solved structures and found that 35 % of the putative DMIs were pre-
dicted correctly including the positions of sidechains, whereas 34 % had
correct backbone predictions.
We also evaluated the metrics of AF-MM for their application in
distinguishing known DMIs from random DMI pairs. We also examined
the effect of sequence length on the tool’s performance and found that
long fragments of full-length proteins might worsen the predictions.
These findings motivated us to develop a fragmentation approach,
where the overlapping fragments were used to predict novel DMIs in hu-
man PPIs. We applied this strategy to 62 PPIs from the HuRI dataset,
where proteins are disease-associated. This strategy improved the sen-
sitivity but decreased AF-MM’s specificity. We further manually in-
spected high-scoring models. We selected some models for further ex-
perimental validation. Using a plate-based bioluminescence resonance
energy transfer (BRET) assay, known for its sensitivity in detecting
point mutation effects and motif-mediated protein-protein interactions
(PPIs), we tested 28 of the 62 PPIs, where BRET signals were significant
100
for 11 of these 28 PPIs. Using the putative structures we selected key in-
teracting residues, that are also conserved and designed mutations that
potentially can disrupt the predicted interface and deletions of the pre-
dicted motif. We further validated seven predicted interfaces. Moreover,
we discovered a novel interface between PEX3 and PEX16 and proposed
a model for their interaction with PEX19. However, our experimental
data also showed inaccuracies and limitations of AF predictions, par-
ticularly for FBXO28-STX1B, STX1B-VAMP2, ESRRG-PSMC5 and
TRIM37-PNKP interfaces, which need more studies for interface eluci-
dation.
In summary, this project provided a thorough assessment of AF-MM
and its metrics, a protein fragmentation strategy predicting novel PPI
interfaces, successfully applied to proteins likely associated with neu-
rodevelopmental disorders. Our prediction, experimentally validated for
6/7 novel interfaces offers molecular insights, while also highlighting the
potential limitations of AF-MM and the need for further advancements
to increase prediction accuracy. So far, this is the largest effort in us-
ing AF-MM for PPI interface prediction coupled with experimentally
validating predicted interfaces.
101
102
Article
Systematic discovery of protein interaction interfaces
using AlphaFold and experimental validation
Chop Yan Lee 1,5, Dalmira Hubrich 1,5, Julia K Varga 2,5, Christian Schäfer 1, Mareen Welzel1,
Eric Schumbera1,4, Milena Djokic1, Joelle M Strom 1, Jonas Schönfeld 1, Johanna L Geist1, Feyza Polat1,
Toby J Gibson 3, Claudia Isabelle Keller Valsecchi1, Manjeet Kumar3, Ora Schueler-Furman 2✉ &
Katja Luck 1✉
Abstract seen tremendous progress in the systematic mapping of human
protein interactions enabling gene function prediction and the
Structural resolution of protein interactions enables mechanistic study of genotype-to-phenotype relationships (Luck et al, 2020;
and functional studies as well as interpretation of disease variants. Drew et al, 2017; Huttlin et al, 2021). However, to understand the
However, structural data is still missing for most protein interac- molecular function of individual PPIs, co-existence or mutual
tions because we lack computational and experimental tools at exclusivity of partner proteins in protein complexes, and the effect
scale. This is particularly true for interactions mediated by short of mutations on protein function, structural information on how
linear motifs occurring in disordered regions of proteins. We find these proteins interact with each other is required. Unfortunately, a
that AlphaFold-Multimer predicts with high sensitivity but limited structure at atomic resolution is only available for ~4% of known
specificity structures of domain-motif interactions when using human PPIs (Luck et al, 2020). Modular proteins interact with each
small protein fragments as input. Sensitivity decreased sub- other using a variety of different functional elements such as stably
stantially when using long protein fragments or full length proteins. folded domains, intrinsically disordered polypeptide regions, short
We delineated a protein fragmentation strategy particularly suited linear motifs (hereafter referred to as motifs), or coiled-coil helices
for the prediction of domain-motif interfaces and applied it to forming domain-domain, domain-motif, disorder-disorder, or
interactions between human proteins associated with neurodeve- coiled-coil interfaces for example. Resources such as 3did (Mosca
lopmental disorders. This enabled the prediction of highly confident et al, 2014) or the ELM database (ELM DB) (Kumar et al, 2022)
and likely disease-related novel interfaces, which we further collect observed contacts between domain types and between
experimentally corroborated for FBXO23-STX1B, STX1B-VAMP2, domains and motifs, respectively. Such interface type collections
ESRRG-PSMC5, PEX3-PEX19, PEX3-PEX16, and SNRPB-GIGYF1 can be used to predict occurrences of known interface types in
providing novel molecular insights for diverse biological pro- protein interactions (Weatheritt et al, 2012; Mosca et al, 2013).
cesses. Our work highlights exciting perspectives, but also reveals However, it is reasonable to expect that many more protein
clear limitations and the need for future developments to maximize interface types remain to be discovered. This is likely particularly
the power of Alphafold-Multimer for interface predictions. true for motif-mediated PPIs, which are anticipated to number in
the hundreds of thousands or millions (Tompa et al, 2014). Motifs
Keywords AlphaFold; Protein Interaction Interface Prediction; Linear are short stretches of amino acids in disordered regions of proteins
Motifs; Benchmarking; Experimental Validation that usually adopt a more rigid structure upon binding to folded
Subject Categories Computational Biology; Structural Biology domains in interaction partners (Davey et al, 2012). Motif-
https://doi.org/10.1038/s44320-023-00005-6 mediated interactions are of moderate binding affinity and thus,
Received 3 August 2023; Revised 4 December 2023; are particularly suited to mediate dynamic cell regulatory and
Accepted 5 December 2023 signaling events (Van Roey et al, 2012). However, due to the
Published online: 15 January 2024 transient nature of their interactions and the disorderliness of
motif-containing proteins, this mode of binding is also expected to
be highly understudied. Systematically generated human protein
Introduction interactome maps (Luck et al, 2020; Huttlin et al, 2021) are likely a
treasure trove for the discovery of novel interface types, yet no good
Protein-protein interactions (PPIs) are essential for the proper experimental or computational methods exist to systematically map
functioning of essentially all cellular processes. The last decade has or predict protein interaction interfaces at scale.
1Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. 2Department of Microbiology and Molecular Genetics, Institute for Biomedical Research Israel-Canada,
Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel. 3Structural and Computational Biology Unit, European Molecular Biology Laboratory,
Heidelberg 69117, Germany. 4Present address: Computational Biology and Data Mining Group Biozentrum I, 55128 Mainz, Germany. 5These authors contributed equally: Chop
Yan Lee, Dalmira Hubrich, Julia K Varga. ✉E-mail: ora.furman-schueler@mail.huji.ac.il; k.luck@imb-mainz.de
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 75
103
1234567890();,:
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
The release of the neural network-based software AlphaFold (AF) interfaces in unperturbed settings, it is still a method that is only
was not only a breakthrough for the prediction of monomeric structures accessible to few experts in the field. Other experimental
of proteins (Jumper et al, 2021) but multiple studies published shortly approaches are needed, which can, ideally at high throughput,
thereafter also suggested the ability of AF to predict structures of confirm predicted interfaces for PPIs. In this study, we thoroughly
pairwise protein interactions and complexes. Sensitivities of around 70% benchmarked the two most recent versions of AlphaFold-Multimer
were reported using benchmark datasets of structurally resolved protein (hereafter referred to as AF) for their ability to predict domain-
interactions originally developed to evaluate docking methods (Akdel domain and domain-motif interfaces (DDIs and DMIs). We found
et al, 2022; Bryant et al, 2022; Johansson-Åkhe et al, 2021; that prediction accuracies drop when using longer protein
preprint:Evans et al, 2021). Other studies focused on structures of fragments or full length proteins for interface predictions and
domain-motif interfaces to specifically evaluate AF’s ability to predict developed a strategy particularly suited for the prediction of novel
structures for this mode of binding, reporting similar success rates domain-motif interfaces in human PPIs. We applied this strategy to
(Akdel et al, 2022; Johansson-Åkhe et al, 2021; Tsaban et al, 2022). Only 62 PPIs from HuRI that connect disease-associated proteins and
a few studies have also evaluated AF’s specificity for the prediction of experimentally assessed the obtained interface predictions for seven
interface structures using controls such as random protein pairs or PPIs using a plate-based bioluminescence resonance energy
mutation of motifs to poly-alanine stretches (Akdel et al, 2022; transfer (BRET) assay (Trepte et al, 2018) combined with site-
Johansson-Åkhe et al, 2021; Tsaban et al, 2022). Different benchmark- directed mutagenesis. We identify novel interface types and report
ing studies used different versions of AF and reported on different on important limitations and sources of errors in AF-derived
metrics for their ability to distinguish good from bad structural models structural models, which pave the way for future improvements in
(Bryant et al, 2022; O’Reilly et al, 2023; Tsaban et al, 2022; the field.
preprint:Evans et al, 2021; Teufel et al, 2023). We generally lack a
comprehensive assessment of the latest AF releases and metrics across
different types of PPI interfaces for their sensitivity, specificity, and Results
potential biases for the prediction of complex structures.
In a landmark study, researchers applied AF onto 65,000 human Evaluating AlphaFold’s accuracy for predicting domain-
PPIs derived from a yeast two-hybrid-based interactome map motif interfaces
(hereafter referred to as HuRI) and highly confident co-complex
associations to structurally annotate the human interactome with AF- To thoroughly assess the ability of AF to predict structures of
derived models. High confidence models were obtained for about binary protein complexes that are formed by a DMI, we extracted
3000 PPIs (Burke et al, 2023). The authors noted a smaller fraction of information on annotated DMI structures from the ELM DB
highly confident structural models obtained for PPIs from the HuRI (Kumar et al, 2022). We selected one representative structure per
dataset compared to the co-complex dataset and reported that motif class (136 structures in total), manually defined the minimal
proteins in HuRI contain more intrinsic disorder and are less domain and motif boundaries, and submitted the corresponding
conserved compared to proteins from co-complex datasets. AF model protein sequence fragments for interface prediction to AF (Fig. 1A;
confidence scores also increased for PPIs with proteins that are less Dataset EV1). The domain sequences from this benchmark dataset
disordered and more conserved, indicating that AF predictions work mostly shared 20–30% sequence identity (Appendix Fig. S1A). To
less well for PPIs mediated by interfaces involving disordered regions evaluate the accuracy of the predicted structural models, we
such as domain-motif interfaces, which likely dominate the human superimposed the actual structure and predicted model on their
interactome (Tompa et al, 2014). However, AF benchmarking studies domains and based on this superimposition, we computed the all
reported similarly high success rates for domain-motif interfaces atom RMSD between the motif of the predicted model and the
compared to general docking benchmark datasets (Tsaban et al, 2022; actual structure (Fig. 1A). We found that 35% of the structural
Akdel et al, 2022). These discrepancies in sensitivities could be a models were so accurately predicted that even the side chains of the
result of two possible factors. First, they might point to differences in motif were correctly positioned while for another 32% the
AF performance if small interacting fragments are used for interface backbone but not the side chains of the motif were accurately
prediction, as done in the benchmark studies, versus full length predicted. For 26% of the structures the motif was modeled into the
sequences used for structure prediction in (Burke et al, 2023). Second, correct pocket, but in a wrong conformation, while, for the
these discrepancies could also point to difficulties of AF to predict remainder of the structures, AF failed to identify the right pocket
structures of interface types involving disordered regions that have (Fig. 1A; Dataset EV1). A similar performance was obtained when
not been solved before, of which there are likely many in HuRI. It using the DockQ metric (Appendix Fig. S1B,C; Dataset EV1). This
remains to be addressed to what extent these two possible factors performance is unaltered when using or switching off AF’s template
contribute to the challenges encountered specifically for domain- function (Fig. S1D,E). The use of DMI structures annotated by the
motif interface modeling. ELM DB enables us to explore potential differences in AF’s
Determination of accuracies of novel predicted interface performance regarding motif properties. We find no significant
structures by AF ultimately requires experimentation. AF interface differences in average model accuracy between different categories
predictions for individual PPIs have occasionally been experimen- of motif classes (two-sided Mann–Whitney test on all pairwise
tally corroborated (Mishra et al, 2023; Bronkhorst et al, 2023). A combinations, n: DEG = 10, DOC = 21, LIG = 94, TRG = 9, MOD =
more systematic experimental confirmation of AF interface models 2, α = 0.05, test statistics of all pairwise combinations between 15
has been conducted using crosslinking mass spectrometry (XL-MS) and 852, Appendix Fig. S1F), although the variance in model
(Burke et al, 2023; O’Reilly et al, 2023). While in-cell XL-MS is a accuracy appears to differ between the motif classes. Similarly, we
very elegant approach to obtain experimental information on PPI found no significant difference in prediction accuracy when
76 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
104
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
A Superimposition RMSD calculation on domain on motif
Wrong pocket
Annotate minimal DEG_APCC_KENBOX_2
136 solved DMI interacting regions 10Correct (7%) 48 Correctpocket 35
DEG_APCC_KENBOX_2 DEG_APCC_KENBOX_2 (26%) (35%)
sidechain
43
(32%)
Correct
backbone
AF DOC_USP7_UBL2_3 C
DOC_USP7_UBL2_3 DOC_USP7_UBL2_3 MOD_SUMO_rev_2: UBE2I & PPIL4
1KPS
Exclude motif 
with unsolved 
residue or PTM
B D EEIKAEKEAKTQAILLEM
Positive Ref. CLV_C14_Caspase3-7: CASP3 & ARHGDIB
Random Ref.
5IAN
1 mutation in motif 2 mutations in motif Randomly paired DMI
1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
Motif chain interface pLDDT
0.8 0.8 0.8 Average interface pLDDT
pDockQ
0.6 0.6 0.6 iPAEResidue-residue contact
Atom-atom contact
0.4 0.4 0.4
EDDDDELDSKLNYKP
0.2 0.2 0.2
0.0 0.0 0.0
E F G
L257 S225
Y306
L138
Figure 1. Benchmarking and application of AF for DMI interface prediction using minimal interacting fragments.
(A) Schematic illustrating the assembly of the DMI positive reference dataset and evaluation of AF prediction accuracies by superimposition of the solved and modeled
structures. Blue and cyan indicate the domain and motif in the native structure, respectively. Orange and yellow indicate the domain and motif in the modeled structure,
respectively. Proportion of structures of DMIs predicted by AF to different levels of accuracy is shown on the right. (B) Area under the Receiver Operating Characteristics
Curve (AUROC) for different metrics using the DMI benchmark dataset as positive reference and the following different random reference sets: Left, 1 mutation introduced
in conserved motif position; middle, 2 mutations introduced in conserved motif positions; right, random reshuffling of domain-motif pairs. Gray horizontal line indicates the
AUROC of a random predictor. (C) Superimposition of AF structural model for motif class MOD_SUMO_rev_2 (orange) with homologous solved structure (PDB:1KPS)
from motif class MOD_SUMO_for_1 (blue). The motif sequence used for prediction is indicated at the bottom, colored by pLDDT (dark blue=highest pLDDT). (D)
Superimposition of AF structural model for motif class CLV_C14_Caspase3-7 (orange) with homologous structure (PDB:5IAN) solved with a peptide-like inhibitor (blue).
The motif sequence used for prediction is indicated at the bottom, colored by pLDDT (dark blue=highest pLDDT). (E) AF prediction of a LIG_HCF-1_HBM_1 motif in
CREBZF (orange) binding to the beta-propeller Kelch domain of HCFC1 (gray). Mutated domain residues for experimental testing are colored in green. (F) Close up on the
interface shown between CREBZF and HCFC1 from (E). Coloring is the same as in (E). Key conserved motif residues are drawn as sticks. Mutated residues in the domain
and motif for experimental testing are labeled. (G) BRET titration curves are shown for wildtype interactions and mutant constructs for CREBZF-HCFC1 pairs for two
biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and
luminescence measurements, respectively.
stratifying by the secondary structure elements adopted by the sequence is (Pearson r < abs(0.08), α = 0.05 Appendix Fig. S1H–J).
motifs (two-sided Mann–Whitney test on all pairwise combina- AF models display significantly more differences to structures
tions, n: helix = 42, strand = 7, loop = 87, α = 0.05, test statistics of solved by other methods, i.e., NMR, than X-ray crystallography
all pairwise combinations between 184 and 2029, Appendix Fig. (two-sided Mann–Whitney test, n: X-ray = 115, Others = 21,
S1G), nor by how hydrophobic, symmetric, or degenerate the motif p < 0.01, test statistics = 811, Appendix Fig. S1K) possibly because
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 77
105
Area Under the Curve
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
NMR structures better represent structural dynamics that AF (and annotated as such in the MOD_SUMO_for_1 class). Here it is
cannot capture, since it was trained to predict the crystallized forms interesting to see how very dissimilar binding modes (flexible for
of proteins. MOD_SUMO_for_1, helical for MOD_SUMO_rev_2), are still able
The all-atom motif RMSD significantly anti-correlates with to place the important binding residues in the same pockets
various AF-derived metrics (Pearson r =−0.55, p-value < 0.05 (Fig. 1C). For CLV_C14_Caspase3-7, the structure of the caspase
Appendix Fig. S1L,M; Dataset EV1) suggesting that these metrics bound to peptide-like inhibitors has been solved (e.g. PDB:1F1J,
are indicative of good versus bad structural models and can be used PDB:5IAN, PDB:6KMZ), and structures of more distant caspases
for de novo interface predictions. To evaluate AF’s ability to bound to a cleaved peptide substrate are also available. For
identify high confident structural models of DMIs, we generated proteases, one great advantage of AF is the ability to model both the
three different random DMI datasets. First, we randomly paired catalytically active enzyme and an uncleaved substrate, which is
domain and motif sequences from the positive reference dataset practically impossible to solve experimentally (Fig. 1D).
taking into account that no motif sequence was paired with a Finally, for LIG_HCF-1_HBM_1 we were not able to identify a
domain sequence from the domain type that the motif is known to homologous structure in the PDB, hence, our AF-derived structural
interact with. Second and third, we mutated one and two key motif models for this motif class are likely novel. Motifs of this class are
residues, respectively, to residues of opposite chemico-physical bound by the N-terminal beta-propeller Kelch domain of HCFC1
properties. Based on the conservation of these key motif residues, consisting of six Kelch repeats. Kelch domains have been shown to
we assume that the mutations would be disruptive to binding, at bind to motifs at a number of different sites, and thus, without
least when experimentally tested using minimal interacting protein prior knowledge, it is difficult to determine where the HCFC1-
fragments. Receiver operating characteristic (ROC) and precision- binding motif (HBM) would bind. HCFC1 is a transcription factor
recall (PR) curves using the positive and random datasets (Fig. 1B; that associates with other transcription factors (Lu et al, 1997),
Appendix Fig. S2A,B; Dataset EV2) show that the domain interface splice factors (Ajuh et al, 2002), and cell cycle regulators (Freiman
residue pLDDT (for all metric definitions, see Methods) or the and Herr, 1997; Machida et al, 2009). We generated AF models of
number of atoms or residues predicted to be in contact with each high confidence for the HCFC1 Kelch domain interacting with
other, discriminated poorly between all reference datasets (AUC multiple motif instances that are annotated in the ELM DB. All
around 0.64). Furthermore, we observed that all tested metrics complexes show the tyrosine of the motif docked into a deep pocket
failed to discriminate interacting from non-interacting interfaces at the bottom/top of the Kelch domain (Fig. 1E,F; Appendix Fig.
when mutating one motif residue (max AUC 0.66). However, the S2F–H), with slight variations in how the tyrosine is exactly
AF-derived metrics model confidence (preprint:Evans et al, 2021), positioned in the pocket (Fig. S2F–H). Based on clone availability
average interface residue pLDDT, average motif interface residue we selected the structural model between HCFC1 and CREBZF for
pLDDT, pDockQ (Bryant et al, 2022), and iPAE (Teufel et al, 2023) experimental validation. For this purpose, we used a BRET protein
discriminated well between both reference datasets when rando- interaction assay that is based on transient overexpression of two
mizing domain-motif pairs or introducing two motif mutations proteins in HEK293 cells (Trepte et al, 2018). Both proteins are
(max AUC 0.86, ROC statistics and ideal cutoffs can be found in expressed as fusion constructs either to the Nanoluc luciferase (the
Dataset EV2). We also evaluated whether the top 5 reported models donor) or mCitrine (the acceptor). Interaction of both proteins
by AF tend to be more similar to each other when corresponding to results in a BRET from the oxidized substrate of the donor to the
a correct structural model (Pozzati et al, 2022) and found that this acceptor molecule, if both are close enough to each other for the
feature has moderate predictive power (Appendix Fig. S2C). BRET to occur (see Methods for details). We observed significant
binding and BRET saturation when assaying wildtype CREBZF and
Application of AlphaFold for providing structural models HCFC1 proteins (Fig. 1G; Appendix Fig. S2I,J). Mutation of the
for motif classes without available structural data [DE]H.Y motif tyrosine to alanine (Y306A) or mutation of two
residues in the Kelch domain pocket (L257F, L138F), which are
After evaluating the accuracy of AF to predict DMIs using minimal modeled to be in contact with the motif tyrosine or histidine
interacting regions, we aimed to use this setup for the prediction of residue (Fig. 1F), strongly reduced BRET signals indicating
structural models for motif classes in the ELM DB for which no weakening or loss of binding (Fig. 1G; Appendix Fig. S2I,J). A
structure of a complex has been solved yet. We identified 125 such pathogenic mutation (S225N, source ClinVar (Henrie et al, 2018))
motif classes based on ELM DB annotations. Of those, we selected close to the pocket slightly reduced expression levels of HCFC1 but
all domain-motif instances where both the motif and the domain did not result in loss of binding (Fig. 1F,G; Appendix Fig. S2I,J).
were derived from human or mouse proteins and submitted the Our experiments suggest that a potential pathogenic mechanism of
corresponding domain and motif sequences for structure predic- this mutation is not mediated via perturbed binding of partners to
tion to AF (Dataset EV3). Using a motif chain pLDDT cutoff of > the Kelch repeat domain pocket of HCFC1 that we identified in this
70, we obtained confident structural models for 21 motif classes. study. Unfortunately, no assertion criteria for the annotation of this
We manually inspected the structural models and noticed that even mutation to be pathogenic is provided by ClinVar meaning that the
though these ELM classes have no annotations with structures, mutation is either not pathogenic after all or its pathogenicity is
solved structures for an exact ELM instance or a very likely new mediated via another perturbed function not tested in this study.
instance for the ELM class are available for 11 out of the 21 cases. Collectively, these experimental results support the structural
For most others, a close homolog structure had been solved, i.e., for models of the HCFC1 Kelch domain pocket - motif interaction
LIG_MYND_3 and LIG_MYND_1, a structure solved by NMR for and overall provide highly confident structural models for multiple
a LIG_MYND_2 interaction is available (Appendix Fig. S2D,E). For motif classes of the ELM DB without available structural
MOD_SUMO_rev_2, a structure of a reversed motif is available information (Dataset EV4).
78 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
106
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
Figure 2. Effect of protein fragment extensions on the accuracy of AF predictions.
(A) Workflow established to assess changes in AF performance upon protein fragment extension. Blue and cyan indicate the domain and motif in the native structure,
respectively. Orange and yellow indicate the domain and motif in the modeled structure, respectively. (B) Heatmap showing the fold change in motif RMSD before and
after extension where positive values indicate improved predictions from extension and negative values indicate worse prediction outcomes upon extension. (C) Heatmap
of the average model confidence for combinations of different motif and domain sequence extensions. (D) Optimal cutoffs derived for different metrics from ROC analysis
benchmarking AF different motif and domain extensions from the reference dataset used in A and random pairings of domain and motif sequences. pLDDT-related metrics
were divided by 100 for visualization purposes. (E, F) Superimposition of the structural model of the minimal (left, orange) or extended (right, yellow) motif sequence with
the solved structure (motif in blue) for two different motif classes as indicated on the top of each panel. The motif sequence from the solved structure is indicated at the
bottom. Motif residues are underlined, motif residues not resolved in the structure have a gray background. Sticks indicate the motif residues, domain surfaces are shown
in gray based on experimental structures. (G) Superimposition of the structural model of the minimal (orange) and extended (yellow) motif sequence with the solved
structure (motif in blue) for a motif instance from the motif class LIG_BIR_III. Motif sequence indicated as in (E). (H) Area under the Receiver Operating Characteristics
Curve (AUROC) for different metrics using the DDI benchmark dataset as positive reference and randomly shuffled domain-domain pairs as random reference. Gray
horizontal line indicates the AUROC of a random predictor.
Evaluation of AlphaFold’s ability to predict interfaces in We then gradually extended the motif and domain sequences by
full length proteins first adding flanking disordered regions, then neighboring folded
domains before using the full length sequences (Fig. 2A).
Most PPIs known to date have been identified using full length Comparison of the motif RMSD computed for extended versus
protein sequences in systematic interactome mapping efforts. For minimal domain-motif pairs from the positive reference dataset
the vast majority of these PPIs, no fragment or interface revealed that the addition of flanking disordered regions on the
information is available. Thus, the question emerges how AF motif or domain side sometimes slightly improved prediction
would perform on DMI predictions when longer protein sequences accuracies while the addition of neighboring structured domains or
or full length proteins are submitted. To answer this question we the use of full length sequences led to a significant worsening of
selected 31 DMI structures from the positive reference dataset used model accuracies (Fig. 2B; Dataset EV5). Interestingly, despite the
above and generated random domain-motif pairs of those as fact that, for smaller extensions, model accuracies remained the
negative control. The selected structures were sampled from same or slightly improved as determined by motif RMSD, AF-
different prediction accuracy categories (Fig. 1A; Dataset EV5). derived metrics such as the model confidence or average motif
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 79
107
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
interface residue pLDDT gradually dropped with increasing to steric clashes. AF predicts the extended motif to bind in reversed
fragment length (Fig. 2C; Appendix Fig. S3A-C). ROC plots of orientation and it is mostly pushed out of the pocket. This
predictions for a benchmark consisting of the positive and random highlights the importance of not only incorporating sequence
domain-motif pairs revealed that, upon extension, the optimal context but also knowledge about the biological context, wherever
cutoff of model confidence and iPAE considerably changed as well possible, into AF modeling and model interpretation.
(Fig. 2D; Appendix Figs. S3D,E, S4A; Dataset EV6). This means
that different model confidence or iPAE cutoffs are to be used Evaluating AlphaFold’s performance for the prediction of
depending on the length of the submitted protein sequences, which domain-domain interfaces
is rather impractical and thus disfavors both metrics for DMI
predictions. The average motif interface residue pLDDT metric Folded domains can not only interact with motifs but also with
appeared to be more robust with respect to fragment length. Based other folded domains forming so-called domain-domain interfaces
on these results we chose this as the main metric and a cutoff of 70 (DDIs). To enable simultaneous prediction of DDIs and DMIs in a
to discriminate good from bad AF-generated DMI models given protein interaction, we set out to evaluate AlphaFold’s
regardless of fragment length. performance on DDI predictions using a reference dataset of 48
DDI structures that we manually curated out of random selections
Extending motif sequences for interface prediction with of domain-domain contact pairs extracted from 3did (Mosca et al,
AlphaFold reveals important motif sequence context 2014). As a negative dataset, we randomized the pairing of these
domains. Using ROC and PR statistics we found that AlphaFold
Various studies have highlighted that flanking sequences of motifs performed slightly worse on this DDI benchmark dataset compared
can influence binding affinities and specificities (Luck et al, 2012; to its performance on DMIs (max AUC 0.73 vs. 0.86) (Fig. 2H;
Bugge et al, 2020). Motif annotations in the ELM DB usually refer Appendix Fig. S4D–F; Dataset EV7) but still showed significant
to the core sequence of the motif, often because information on discriminative power. Interestingly, the best performing metric for
putative roles of flanking sequences is missing. In the previous DDI predictions was the average interface pLDDT score with an
section, we observed that some motif extensions notably improved optimal cutoff of 75, which ranked fourth for DMI predictions.
AF prediction accuracies. In the hope that these cases would point
to motifs with important sequence context, we manually inspected Comparison of AlphaFold v2.2 with v2.3
eight predictions for which the motif RMSD decreased by more
than 1 Å when extending the minimal motif sequence once to the During the course of our work, AF multimer version 2.3 was
left and right by the length of the motif (extension step 1 in Fig. 2A; released. To determine whether the new release improved DMI and
Appendix Fig. S4B). DDI prediction accuracies, we repeated all benchmarking with AF
By doing so interesting patterns emerged: The most prevalent v2.3 and found that motif RMSDs and other AF-derived metrics on
contribution to increased prediction accuracies is the stabilization of average improved compared to AF v2.2 when using minimal
the secondary structure of the motif contributed by both sidechain and interacting fragments (Appendix Fig. S5A–D; Dataset EV1, two-
backbone atoms in the flanking regions, as shown for the interaction sided Wilcoxon signed-rank test on motif all atom RMSD: n = 136,
involving the motif LIG_CAP-Gly_2 (Fig. 2E; Appendix Fig. S4C). For W = 2413, p < 0.0001). AF v2.3 still showed a decrease in prediction
the LIG_NBox_RRM_1 motif, AF placed a part of the domain into the accuracy when using extended protein fragments but this decrease
binding pocket rather than the motif, although the motif had the was less pronounced compared to the corresponding decrease for
correct helical conformation. Elongation of the motif extended this v2.2 (Appendix Fig. S5E,F; Dataset EV5). Despite these improve-
helix, thereby increasing the interaction surface and eventually ments on the sensitivity side of AF, when benchmarked against
pushing out the domain’s tail from the pocket (Fig. 2F). This fits random datasets, overall prediction accuracies only slightly
with other reports where AF has been shown to predict preferential improved compared to v2.2 (Appendix Fig. S5G,H; Appendix Fig.
binding of competing motifs (Chang and Perez, 2023). For the S6A–C; Dataset EV2, EV6, EV7, EV8).
LIG_HOMEOBOX class prediction, the motif is positioned in the
wrong pocket unless flanking regions are included (Appendix Fig. Application of AlphaFold for the discovery of novel
S4C). For DOC_MAPK_JIP1_4, motif extension results in an interfaces in protein interactions without any a priori
extended motif conformation and consequently in a structural model interface information
with lower overall RMSD (Appendix Fig. S4C). For the LIG_GYF
class, most models converge into an inverse orientation of the Since the use of larger or full length protein sequences leads to a
backbone except for one of the extended motifs, which lies in the poor sensitivity for DMI predictions by AF, we devised the
binding pocket in the correct orientation (Appendix Fig. S4C). In following strategy for the use of AF for interface predictions for
summary, these analyses point to motif classes whose sequence known protein interactions: Using AF models of the full length
boundaries could be refined. monomeric structures of both interacting proteins, we decided on
Interestingly, for a motif instance from the LIG_BIR_III_2 class, boundaries between structured domains and disordered regions
slight motif extensions actually led to a substantial decrease in based on manual inspection (see Methods). We then fragmented
prediction accuracy. In this case, the motif is located at a neo-N- the disordered regions by designing overlapping fragments varying
terminus that is only revealed after cleavage of the protein by a in length from ten residues up to the length of the respective
caspase (Fig. 2G). When the motif is extended in the context of the disordered region (Fig. 3A). We then paired disordered with
full length protein, the residues now upstream of the previous neo- ordered, and ordered with ordered fragments for interface
N-terminus likely impede binding of the motif into the pocket due prediction by AF (Fig. 3A). To assess to which extent this
80 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
108
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
A B C
A B 2 No prediction10 Fragments
Longest extensions performed Correct
No result
1 obtained 5 6 Likely10
11 correct12
0 Wrong 3
AlphaFold 10
14 16
p1_
1 B_1 X_2 S_1I K O_
1
dH_
1 _1 _1 _4 X _2 _1 _4 _1
A B Kea O
5 2 O II 1
_ _S
W N G n F6 AM TLS RB _I BM JIP ATH Likely
lch M2 _KE
NB _T R a A P _ N IR AK _ _ M_CA L_F _U2 IG_ NA IG_ _B G2 PK 7_ wrong Questionable
Ke NMD CC _A beta CR LM L P
C L IG L
G_ _ P OC 2 _O _U LIG
_ L EF_
A C_M
A USP
DE DEG G_A
_
D _AP LIGE G L
IG C
R LI
G_ DO DO
D T
Highest scoring and repeatedly 
identified interface
D F
SYT1 TRIM37
TCF12 LIG4
CSNK2B XRCC4TLK2 NGLY1 PNKP
CSNK2A1 MIP VAMP4 SYP
QRICH1 PAX6 SET SLC16A2 NFE2L2 BICD2 UBA5PLP1 TMEM237
PUF60 POGZ MFFKCTD7 RARB PSMC3 WAC DCX GABARAPL2
TH PRKAR1B
VEZF1 ESRRG PSMC5 MMGT1FBXO28
CUL3
HNRNPK MOBP ZBTB10 LZTR1PRKAR1A
ARHGEF9 CAMK2G STX1B PEX16RORB
ACTB FTSJ1 GNAI3 TNPO3 CAMK2A VAMP2 PEX3
ACTG1 CERT1 GPSM2 GCH1 SOX5 PEX19 AF prediction result 
based on inspection 
TBC1D23 TTC19 MAB21L2 OTX2 APTX SNRPB
BRET detection: and solved structures:
Not tested Correct 
SSBP3 FH AP1S2 RPS26 FLAD1 GIGYF1 Likely correct
Tested, no interaction Questionable
UBE3A RARS1 ASF1A PEX12 EBF3 NECAB2 Likely wrongTested, interaction Wrong
Not mutated No result obtainedNo prediction 
TAT CCDC115 H4C8 TREX1 EBF2 KANSL1 Mutated performed
E
Number of PPIs
0 5 10 15
No prediction performed pDockQ score
No result obtained from Burke et al.
Wrong
Likely wrong No Score<0.23
Questionable 0.23−0.5
Likely correct >0.5
Correct
Figure 3. AF prediction and experiments on PPIs connecting NDD proteins.
(A) Schematic of the fragmentation approach applied on a pair of interacting proteins, A and B. Proteins are fragmented into folded and disordered regions based on
manual inspection. Disordered regions are further fragmented. All disordered and folded fragments of one protein are paired with the folded regions of the other protein
and vice versa for AF prediction. (B) Accuracy measured in motif RMSD compared to native structures for models obtained from fragmenting proteins from 20 DMIs from
the positive reference dataset and comparison to model accuracy obtained when using (near) full length proteins for structure prediction (red crosses). Only models that
meet the cutoff for identifying high confident models are shown. Six DMIs did not result in any such model. The gray horizontal line indicates the RMSD cutoff used to
identify accurate models (see methods for details). (C) AF prediction outcome on 67 HuRI PPIs connecting NDD proteins. (D) PPI networks illustrating AF prediction
outcomes and experimental retesting of PPIs in BRET assay. (E) Number of PPIs connecting NDD proteins with structural models at indicated pDockQ cutoffs from (Burke
et al, 2023) grouped based on AF prediction outcomes using the fragmentation approach as shown in (C). (F) cBRET, total luminescence, and fluorescence for 28 PPIs
connecting NDD proteins that were tested in the BRET assay. Luminescence and fluorescence measurements indicate expression levels of NL and mCit fusion proteins,
respectively. Black horizontal lines indicate expression level and PPI detection cutoffs. The gray vertical line separates the detected (left) from undetected PPIs. Protein
pairs in bold indicate those selected for interface validation via site-directed mutagenesis. Error bars indicate STD of three technical replicates. Source data are available
online for this figure.
fragmentation approach would lead to an increase in sensitivity but models for an additional 5 of the 20 DMI pairs. Applying the full
also in false model predictions, we selected 20 out of the 31 DMI fragmentation approach onto all 20 DMI pairs resulted in accurate
structures that were previously used to investigate the effect of model prediction for an additional 6 DMI pairs (Fig. 3B)
fragment extension on prediction accuracies. We attempted model representing an increase in sensitivity for full length vs fragments
prediction with the full length sequences of these 20 DMI pairs and from 5 to 60%. We then shuffled the 20 DMI pairs to generate 20
obtained a model for two of which only one met the motif interface random DMI pairs for which we performed the fragmentation
pLDDT cutoff and corresponded to an accurate prediction approach. As expected from an earlier estimated 20% false positive
(TRG_AP2beta_CARGO_1 in Fig. 3B; Dataset EV9, see methods rate (FPR) (Appendix Fig. S4A), 19 of the 20 random protein pairs
for details). We then switched to using fragment extension step 5 had at least one fragment pair that produced a model above the
for motifs and/or 2 for domains (Fig. 2A) and obtained accurate motif interface pLDDT cutoff (Appendix Fig. S6D; Dataset EV9)
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 81
109
AF prediction result
Motif all atom RMSD
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
indicating that predictions done using this fragmentation approach Fig. S7B,C) indicating that PSMC5 might bind to ESRRG via this
can substantially increase sensitivity while also producing a pocket but not with the predicted motifs.
considerable number of false models using the established scoring AF predicted a coiled-coil interface between STX1B and VAMP2
metrics. This needs to be taken into account when modeling new of moderate confidence (Fig. 5A,B). STX1B is a close homolog to
interactions with this fragmentation strategy, as covered in the STX1A, which binds in a 4-helix bundle to VAMP2 together with
following section. SNAP25 in a 1:1:2 stoichiometry, respectively, as observed by
We selected PPIs from HuRI that connect proteins associated crystallography (PDB:1N7S (Ernst and Brunger, 2003)). This
with neurodevelopmental disorders (NDDs) and subjected these to structure together with our predictions suggest that STX1B might
our AF fragmentation pipeline to predict putative DMIs and DDIs. bind VAMP2 in a similar way. Indeed, removal of the single helical
For 51 out of 62 PPIs we obtained at least one structural model of SNARE domain in STX1B led to complete loss of binding to
significant confidence (Fig. 3C,D). In retrospect, manual inspection VAMP2 (Fig. 5C; Appendix Fig. S8A,B). Interestingly, FBXO28 was
of the predictions obtained for these PPIs revealed that, for 9 PPIs, predicted by AF to bind to STX1B via a similar coiled-coil interface
a solved structure of the interface was already available. Reassur- involving an extended helix in FBXO28 and the SNARE domain in
ingly, six out of these were accurately predicted by AF. For the STX1B (Fig. 5A,D). Here, deletion of the SNARE domain in STX1B
remainder of the PPIs, 12, 16, and 14 resulted in a likely correct, or of the extended helix in FBXO28 reproducibly reduced, but did
questionable, or likely wrong prediction, respectively, based on not abolish the interaction between STX1B and FBXO28 (Fig. 5E;
manual inspection of the models (Fig. 3C,D; Dataset EV10). Likely Appendix Fig. S8C,D). We identified three pathogenic or likely
wrong predictions were scored as such based on docking of the pathogenic mutations in the SNARE domain of STX1B in ClinVar
protein partner into nucleic acid or metal ion binding or of which V216E and G226R are associated with generalized
catalytically active sites. We also considered structural models as epilepsy with febrile seizures plus, type 9. Testing all three
likely wrong, if different protein fragments of the partner were mutations in the BRET assay we observed a drastic decrease in
predicted with similarly high scores to bind to the same pocket on binding for STX1B V216E to FBXO28 (Fig. 5F; Appendix Fig.
the domain. More detailed information can be found in Methods S8C,D). However, the measured effects of the mutations on the
and Appendix Text S1. Of note, for 8 of the 12 PPIs with a likely FBXO28-STX1B interaction do not correlate with their location at
correct prediction, AF predictions performed using the full length the predicted interface. V216E, for example, is not predicted to be
proteins (Burke et al, 2023) did not result in a high confidence in contact with residues of FBXO28 (Fig. 5D). This indicates that
prediction (Fig. 3E). 28 of the 62 PPIs were in our hands amenable the actual predicted orientation of the two extended helices with
to experimental testing using the BRET assay introduced earlier respect to each other is likely incorrect.
(see Methods for details). Significant BRET signals were observed The fact that the deletion of the extended helix in FBXO28 or
for 11 of these 28 PPIs (Fig. 3F). Of those, 7 PPIs were selected for the SNARE domain in STX1B reduced but did not abrogate binding
validating the predicted interfaces (Fig. 3D,F). The remaining four of both proteins to each other (Fig. 5E) suggests that a secondary
PPIs were not further considered because for three of them a interface might exist. Indeed, AF predicted additional interfaces
structure already exists (CSNK2B-CSNK2A1, PNKP-XRCC4, between FBXO28 and STX1B involving folded and disordered
UBA5-GABRAPL2) and for the fourth interaction (KCTD7- regions in both proteins (interfaces i and ii in Fig. 5A). Mutations
CUL3) we classified the predicted interface as likely wrong. Next, designed to disrupt these interfaces partially confirmed the
we will first describe failures in validating predicted interfaces involvement of some of these regions in binding as assayed with
followed by the successes. BRET (Appendix Fig. S8E–H). In addition, the pathogenic
For the interaction between PNKP and TRIM37, we obtained mutation R348L in FBXO28 predicted to be at interface ii seemed
high confident structural models involving two different interfaces. to increase binding to STX1B (Appendix Fig. S8I–L). In summary,
AF predicted the PNKP FHA domain to bind to several disordered our experimental data indicate that multiple regions of FBXO28
stretches in TRIM37 (Fig. 4A) that are overall negatively charged. and STX1B may be involved in the binding but the exact structural
These short regions were predicted to bind to a pocket on the FHA details of this interaction remain to be elucidated. In the following
domain that is known to bind phosphorylated threonines two sections, we will describe in more detail successful interface
(Durocher et al, 2000), which led us to conclude that these validations for interactions involving PEX3, PEX19, and PEX16 as
predictions were likely wrong. AF also predicted the MATH well as SNRPB and GIGYF1.
domain of TRIM37 to bind to two separate disordered putative
motifs located between the FHA domain and phosphatase domain PEX3, PEX19, and PEX16
in PNKP (Fig. 4A–C). However, none of the mutants aimed at
disrupting the predicted interfaces (Fig. 4B) involving the MATH The interaction interface between PEX19 and PEX3 has been
domain showed a decrease in BRET signal compared to wildtype structurally resolved before and consists of an interaction between
(Fig. 4D; Appendix Fig. S7A) indicating that TRIM37 and PNKP do an N-terminal motif in PEX19 that binds to the cytosolic alpha-
not interact with each other via this interface. helical domain of PEX3 (PDB:3MK4, (Schmidt et al, 2010)). Using
AF predicted with high confidence binding of PSMC5 to the corresponding protein fragments, AF predicted a structural model
hormone receptor domain of ESRRG via two distinct motifs that is highly similar to the solved structure (Fig. 5G; Appendix Fig.
(Fig. 4E–G) with similarity to LxxLL motifs known to bind this type S9A,B). We introduced mutations in the PEX19 motif and PEX3
of domain (LIG_NRBOX in ELM DB). We reproducibly found that pocket (Appendix Fig. S9A) and found that F29K in the motif
none of the motif mutations in PSMC5 decreased binding to weakened but clearly maintained BRET binding signals indicating
ESRRG compared to wildtype while both domain pocket mutations the existence of a secondary binding site between both proteins
led to a remarkable reduction in BRET signal (Fig. 4H; Appendix (Fig. 5H; Appendix Fig. S9C,D). Indeed, AF predictions with other
82 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
110
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
A 7 109 146 330 366 521 D
PNKP FHA Phosphatase Kinase
81 i 80 ii 82 83 83
TRIM37 RING BB Helix MATH
80 93131 254 276 404 525-555 930-940
B C
i N376 ii
F328
S114 P112
RTPESQP TPLVSQDEKRDAELPKKRM
E 223-234 F G
127 195 235 458 iii iv
M453 M453
ESRRG ZnF Hormone receptor I280
I280
80 iii 91 iv 79
PSMC5 CC OB AAA domain + lid I401
L134
20 69 127 147 391 M138
132-141 399-406 DPLVSLMMVE MSIKKLWK
H
Motif iii mutated Motif iv mutated
Figure 4. Verification of interface predictions for TRIM37-PNKP and ESRRG-PSMC5.
(A) Schematic of the domain architecture of PNKP and TRIM37 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the
respective interface. Roman numbering refers to structural models in (B) and (C). (B) Structural model of interface i shown in (A) with labeled residues that were mutated.
(C) Structural model of interface ii shown in (A). (D) BRET titration curves are shown for wildtype interaction and mutants for two biological replicates, each with three
technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements,
respectively. The BRET trajectory could not be fitted because of an unusual saturation behavior (see methods for details). (E) Schematic of the domain architecture of
ESRRG and PSMC5 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to
structural models in (F) and (G). (F) Structural model of interface iii shown in (E) with labeled residues that were mutated. (G) Structural model of interface iv shown in
(E). (H) BRET titration curves are shown for wildtype interaction and mutants of ESRRG-PSMC5 pairs for two biological replicates, each with three technical replicates.
Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. In panels
(B), (C), (F), and (G) motif sequences are indicated at the bottom. Gray letters indicate residues not predicted to bind. Source data are available online for this figure.
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 83
111
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
A 1-22 23 183 237 287 C
STX1B syntaxin domain SNARE domain
i 51 iii 85 iv 92
59 ii 53
FBXO28 Fbox helix bundle extended helix VAMP2 synaptobrevin
B 63
221 240-257 258 333343-360 26 116
iv
D
G226 V216iii S239
E F
G I
13 45 46 367 vi H6
PEX3 TM Helical domain PEX16
vi 80 vii 91
v 94 81-88
PEX19 PMP-binding PEX16 TM domain
11-31 91-161 171 262 19 132 214 286 H5
H
H4 PEX3
H3 H1H2
PEX19
J K L
vii W189
R54 K169
E272
disordered fragments of PEX19 paired with the PEX3 domain and 5 to dock into the primary and secondary pocket, respectively
resulted in highly confident models for interfaces involving a (Fig. 5G,I), supporting simultaneous interaction via both interfaces.
binding pocket on PEX3 that is distal to the pocket where the While the interaction between PEX3 and PEX16 has been
N-terminal PEX19 motif is known to bind. When using a protein described before, little is known about how both proteins interact
fragment that spans the full disordered N-terminal region of PEX19 with each other. The monomeric AF model of PEX16 shows a
(1–170), AF predicts the known PEX3-binding motif and helix 4 helical fold, which could in its entirety be transmembrane (TM).
84 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
112
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
Figure 5. Verification of interface predictions for STX1B-FBXO28, STX1B-VAMP2, PEX3-PEX19, and PEX3-PEX16.
(A) Schematic of the domain architecture of STX1B, FBXO28, and VAMP2 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT
(for order-disorder fragment pairs) or average interface pLDDT (for ordered-ordered fragment pairs) for the respective interface. Roman numbering refers to structural
models in (B), (D), Appendix Fig. S8E, and Appendix Fig. S8I. (B) Structural model of interface iv shown in (A). In panel (B) and (D), the chains are color-coded according
to the colors of the domains in (A). (C) BRET titration curves are shown for wildtype interactions and deletion constructs for two biological replicates, each with three
technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements,
respectively. (D) Structural model of interface iii shown in (A) with tested pathogenic mutations labeled and colored in green. (E, F) BRET titration curves are shown for
wildtype interactions and deletion constructs for two biological replicates, each with three technical replicates. Protein acceptor over protein donor expression levels are
plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. (G) Schematic of the domain architecture of PEX3, PEX19, and PEX16
with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the respective interface. Roman numbering refers to structural models
in (I), (J), and Appendix Fig. S9A. Region vi covers residues 1–170, which includes the previously reported N-terminal motif as well as three putative motifs suggested by
the AF models. (H) BRET titration curves are shown for wildtype interaction and mutants of PEX3-PEX19 pairs for three technical replicates. Protein acceptor over protein
donor expression levels are plotted on the x-axis determined from fluorescence and luminescence measurements, respectively. The left plot displays mutants aimed at
disrupting binding between PEX3-PEX19 while the right plot displays mutants aimed at disrupting the PEX3-PEX16 PPI why binding between PEX3-PEX19 should not be
altered. (I) Superimposition of structural models of interface vi (PEX3-PEX19) and vii (PEX3-PEX16) on the PEX3 domain. Note that modeling smaller fragments of PEX19
generates alternative interactions with the binding sites. (J) Structural model of interface vii shown in (G). (K) BRET values with subtracted bleedthrough for PEX3-PEX16
wildtype and various mutated constructs. Three technical replicates are shown. (L) Proposed model for how the trimeric complex of PEX3, PEX19, and PEX16 might
assemble at the peroxisomal membrane. Source data are available online for this figure.
Between the putative TM helix 4 and 5 there is a large loop to the known PEX3-binding motif in PEX19 and a second one
(132–214), which was predicted by AF with very high confidence to corresponding to a novel motif (residues 99–146) docking at a
bind to a third pocket on the PEX3 domain, opposite to both hitherto unknown second binding site on PEX3 for PEX19. This
binding sites mentioned earlier for PEX19 (Fig. 5G,I,J). Of note, model explains how PEX3 is anchored to the peroxisomal
different fragments of this loop as well as the entire PEX16 were membrane via PEX16 and how PEX3 can bind very tightly
repeatedly predicted to bind in similar modes to PEX3, further PEX19, which can then deliver PMPs to the peroxisome. Mutations
increasing the confidence in this prediction. Encouraged by these in any of the three PEX proteins are associated with severe
results, we submitted all three full length PEX sequences for developmental phenotypes referred to as peroxisome biogenesis
complex prediction to AF and obtained a model that supports disorders (Fujiki et al, 2022). The vast majority of the around 150
simultaneous binding of PEX16 and PEX19 to PEX3 (Appendix mutations annotated for the three proteins are uncharacterized
Fig. S9E). We individually mutated two residues in the PEX16 loop, (Henrie et al, 2018), dozens of which fall into the predicted
deleted the loop in its entirety (del162-192), and mutated two interfaces. The structural models obtained from this work can
residues on PEX3 (highlighted in Fig. 5J). Unfortunately, higher inform future studies aimed at characterizing the effects of these
expression levels of PEX16 seem to trigger degradation of PEX3 mutations.
(Appendix Fig. S9F), which we did not observe for the same
constructs when co-expressed with PEX19 (Appendix Fig. S9G). As SNRPB and GIGYF1
a consequence, we could not obtain titration curves and BRET50
estimates but obtained reliable BRET signals for lower PEX3- AF predicted two different types of interfaces with high confidence
PEX16 DNA transfection ratios showing that the deletion as well as for the interaction between SNRPB and GIGYF1. The first interface
both PEX3 mutants significantly decreased binding to PEX16 involves the LSM domain of SNRPB which was predicted to bind to
(Fig. 5K; Appendix Fig. S9H). Of note, these PEX3 mutants (R54S various fragments in the long disordered regions of GIGYF1
and E272R) did not alter binding to PEX19, showing that the (Fig. 6A). These regions do not display any common sequence
overall structural integrity of PEX3 was not perturbed by these pattern. The structure of SNRPB has been resolved as part of the
mutations (Fig. 5H; Appendix Fig. S9D). Sm ring complex that binds small nuclear RNA (PDB:4WZJ,
PEX3 and PEX19 are peroxin proteins that regulate peroxisome (Leung et al, 2011)) showing that the surface on the LSM domain
homeostasis. PEX16 is believed to serve as an integral membrane- predicted to bind to disordered fragments of GIGYF1, is actually
bound receptor for PEX3 (Matsuzaki and Fujiki, 2008) while PEX3 engaged in binding LSM domains of other Sm proteins within the
is thought to serve as a docking site for PEX19 (Fujiki et al, 2006). complex (Fig. 6B). We thus conclude that these predictions are
PEX19 in turn is a cytosolic carrier for peroxisomal membrane likely wrong. The second type of interface predicted by AF involves
proteins to the peroxisome (Fujiki et al, 2006). Combining results the GYF domain in GIGYF1 and multiple short disordered
from previously published functional studies with the structural fragments in the C-terminal region of SNRPB, which repeatedly
and experimental results obtained in this study, a model for a carry the sequence PPPGM(R) (Fig. 6A,C). We designed various
trimeric complex between PEX3, PEX19, and PEX16 emerges deletion constructs of SNRPB that would gradually remove more
(Fig. 5L) where PEX16 fully inserts into the peroxisome membrane and more of the repeated proline-rich motif. We observed, using
via a fold that consists of seven helices (residues 19-286) with its the BRET assay, that these deletion constructs gradually decreased
N-terminal end being cytosolic and its C-terminal end protruding binding to GIGYF1 (Fig. 6D; Appendix Fig. S10A,B). We also
into the peroxisome. The extended loop between TM helix 4 and 5 mutated the GYF domain pocket and found that W498E but not
reaches into the cytosol and docks onto PEX3, which is further L508F would decrease binding to SNRPB (Fig. 6D,E; Appendix Fig.
anchored into the peroxisomal membrane via its N-terminal TM S10A–D). To further corroborate these findings we performed a co-
helix (residues 13–45). PEX19 docks onto PEX3, opposite to where immunoprecipitation experiment, where endogenous GIGYF1
PEX16 is bound, via two interaction surfaces—one corresponding interacted with HA-tagged full length SNRPB (Fig. 6F). This
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 85
113
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
A B
7 85 240 80-88
ii
SNRPB LSM
 i 77-91 ii
80-82
GIGYF1 GYF
1-11 351-361 476 535 606-616 883-893
426-436 556-566
C i (4WZJ)
D
W498
L508
PPPGMRPPRP F 5% Input HA-IP G
kDa
E 55 HA (SNRPB)
40
170 
130 GIGYF1
100 
40 GAPDH (control)
Snrpb Snrpb
RPPPGLTN 
(7RUQ)
Figure 6. Verification of interface predictions for SNRPB-GIGYF1.
(A) Schematic of the domain architecture of SNRPB and GIGYF1 with indication of top predicted interfaces. Numbers in blue indicate the motif interface pLDDT for the
respective interface. Roman numbering refers to structural models in (B) and (C). (B) Structural model of interface ii shown in (A) (left) and in comparison a solved
structure (PDB:4WZJ) of the Sm ring complex (right) bound to RNA (orange). The LSM domain of SNRPB is shown in cyan. The position of the predicted motif (left) or
neighboring LSM domain of SNRPD3 (right) are indicated in gold. Black circles indicate the predicted interface in the model and corresponding interface in the complex on
the LSM domain of SNRPB. (C) Structural model of interface i shown in (A) with tested domain mutations labeled and colored green. The motif sequence is indicated at the
bottom. (D, E) BRET titration curves are shown for wildtype interactions, deletion constructs of SNRPB, and single point mutants in GIGYF1 for two biological replicates,
each with three technical replicates. Protein acceptor over protein donor expression levels are plotted on the x-axis determined from fluorescence and luminescence
measurements, respectively. (F) Cropped immunoblot of input (5%) and HA antibody immunoprecipitation (IP) performed in parental HEK cells (empty, untagged
negative control), Snrpb(full-length, 1-231)-2xHA-mNeonGreen, Snrpb(1-190)-2xHA-mNeonGreen expressed from a single locus in Flp-In™ T-REx™ 293 Cell Lines. The HA
antibody was used for detecting the immunoprecipitated Snrpb-proteins, endogenous GIGYF1 was detected with GIGYF1 antibody, GAPDH serves as a loading and
negative-IP control. The experiment was performed twice with equivalent outcome, one representative experiment is shown. (G) Solved structure (PDB:7RUQ) of the GYF
domain of GIGYF1 bound to a proline-rich motif in TNRC6C. The sequence of the motif in TNRC6C is indicated. Source data are available online for this figure.
interaction appeared less pronounced upon truncation of the charged residues establishing important contacts with the domain
C-terminal proline-containing region of SNRPB (Fig. 6F). This (PDB:1L2Z, (Freund et al, 2002)). This structure formed the basis
further suggests that both proteins interact with each other in cells for the definition of the LIG_GYF motif class in the ELM DB. The
and that this interaction is stabilized by the predicted interface. recently resolved structure of the GYF domain of GIGYF1 together
During the course of these studies, a structure was published with our structural models and experimental validations argue for
(PDB:7RUQ, Sobti et al, 2023) showing binding of the GYF domain an extension of the existing motif definition or definition of a new
of GIGYF1 to a motif of sequence PPPGL of the protein TNRC6C motif subclass.
confirming the binding mode predicted by AF where a hydrophobic
residue (M or L) inserts into a hydrophobic pocket and where the
proline residues contact the surrounding domain surface Discussion
(Fig. 6C,G). Interestingly, this hydrophobic pocket does not exist
in the previously solved structure of the GYF domain of CDBP2 AF has revolutionized the field of structural bioinformatics and has
binding to a proline-rich peptide that is flanked by positively sparked much excitement about its potential to predict structures of
86 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
114
empty
1-231
1-190
empty
1-231
1-190
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
interacting proteins and bringing us closer to a structurally resolved disordered regions generally decrease AF prediction accuracies as
protein interactome. However, from existing studies it largely also reported in a recent preprint (preprint:Bret et al, 2023).
remained unclear whether AF’s performance depends on the type Furthermore, optimal cutoffs for various metrics such as the model
of interfaces and the length of submitted protein chains for confidence decreased when using longer protein fragments, making
interface prediction, which metrics perform best in identifying them less robust for interface prediction with AF. When evaluating
likely correct structural models of interfaces, how specific AF performance differences for longer and shorter protein fragments
predictions are, and to which extent highly confident structural we identified three DMI pairs involving the motif classes
models can be experimentally corroborated. In this study, we DEG_APCC_KENBOX_2, LIG_Pex14_3, and LIG_GYF, for
showed that AF performs similarly well for interfaces between which, during fragment extension, a second known motif
folded domains and interfaces formed between a folded domain occurrence was added to the fragment. This second motif was
and a short linear motif. Using minimal interacting regions for selected by AF during interface prediction, displacing the original
interface prediction we reached sensitivities of up to 80% similar to motif and leading to a high RMSD score. We removed these
previously published work (Tsaban et al, 2022; Johansson-Åkhe instances from the dataset when evaluating AF’s performance on
et al, 2021). We thoroughly investigated AF’s FPR using random fragment extension but they point to biologically correct variability
domain-motif pairs and found it to be around 20%. However, in AF prediction outcomes due to existing multivalency of many
asking AF to discriminate binders from non-binders when motif DMIs in protein interactions. Other work suggested that AF is able
sequences carried one disruptive mutation, we found that to select the stronger binder among two motif occurrences (Chang
prediction accuracies were close to random. This points to an and Perez, 2023), which might at least in some cases guide AF
important limitation in AF’s ability to predict binding specificities motif selections. However, in other cases this motif preference
and is in line with previous reports on AF’s inability to predict the might also hinder discovery of multivalency in PPIs. For example,
effect of mutations (Buel and Walters, 2022). Comparison of the use of smaller protein fragments for the protein pair SNRPB
different metrics to discriminate good from bad structural models and GIGYF1 enabled the discovery of a proline-rich repeat motif
using either minimal interacting fragments or extensions revealed in SNRPB.
the average interface pLDDT for DDI models and the motif In comparison to predictions made using full length proteins
interface pLDDT for DMI models to be the most robust and best (Burke et al, 2023) we found that protein fragmentation increased
performing metrics. However, when manually inspecting AF the probability of obtaining a high confidence interface prediction,
predictions we found it useful to also consider AF’s model especially for cases involving proteins with long disordered regions
confidence, suggesting that in the future a combination of different such as GIGYF1. For smaller and more globular proteins like the
metrics might be even more powerful to discriminate good from PEX proteins studied above, full length predictions can identify the
bad structural models. The alignment depth has been previously right binding sites but these can be further substantiated by
reported to somewhat influence model accuracy (Bryant et al, running additional predictions with smaller fragments. The
2022). While this feature was not investigated here, it might serve fragmentation approach increases the number of prediction runs
as a pre-filter to identify PPIs of high conservation for which per protein pair from one to a couple hundred, depending on the
structural modeling will likely be more successful. Interestingly, the length and modularity of both proteins. The vast majority of these
number of residues or atoms predicted to be in contact with each fragment pairs should not interact. With a FPR of 20%, this means
other was poorly predictive, in contrast to a previous report (Bryant that more actual non-interacting than truly interacting fragment
et al, 2022), confirming our observations that the tested AF versions pairs will result in a high confidence prediction. A big challenge is
in this study will always put both chains in contact with each other thus to identify likely correct interface predictions among the many
to create atomic contacts, and from visual inspection alone it is very false ones. This is also illustrated by the prediction results that we
challenging to tell good from bad structural models apart. Of note, obtained for the seven protein pairs that we followed up
observed differences in AF performance across studies likely experimentally. Clearly, AF’s general limited specificity contributes
originate both from using different benchmark datasets and to these false predictions. We observed that additional sources of
different AF versions. Our study is unique in that it assesses error can arise from exposed intramolecular binding sites resulting
multiple metrics on two different classes of interfaces, DMIs and from fragmentation, incorrectly designed boundaries of folded
DDIs, using two different AF versions. More work is needed to regions, and docking of protein fragments into enzymatic pockets
develop benchmark datasets of coiled-coil and disorder-disorder of metabolic enzymes or sites for metal ion, DNA, or RNA binding.
interfaces to also evaluate AF’s performance for these modes of It seems that AF is overall well suited to find binding pockets on
binding. Of note, our benchmark datasets almost exclusively folded domains. However, our work also clearly demonstrates that
consisted of structures that AF has seen in the training process. AF is able to correctly dock the matching partner structure into
Interestingly, benchmark studies done with unseen structures these pockets without the need for a pre-existence of both partner
reported similar sensitivities (preprint:Bret et al, 2023) indicating structures in the bound conformation contrary to other state-of-
that AF is not strongly biased towards structures it has seen before. the-art docking algorithms. AF’s high sensitivity with respect to
We extensively explored the influence of protein fragment intramolecular binding sites and wrongly fragmented folded
length on AF’s performance and found that slight extensions of regions will make it particularly hard to fully automate the
minimal motif sequences can improve prediction accuracies. fragment design process. Despite these challenges we found that
Inspection of individual cases revealed novel information on recurrent interface predictions from overlapping fragments can
important motif sequence context that was so far missing in help gain confidence in predictions, as also highlighted in a recent
corresponding motif entries at the ELM DB. However, longer study (Bronkhorst et al, 2023), since we rarely observed this
disordered fragments or fragments containing ordered and large recurrence for likely wrong predictions.
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 87
115
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
Given the reported uncertainties in AF predictions, even for annotated as true positives (Kumar et al, 2022). The structures
high confidence cutoffs, experimental validation is essential. The were subject to a series of manual inspections to check their
BRET assay used here has been shown in previous studies to be validity for further analysis. First, since AlphaFold can only model
sensitive enough to quantify weakening of binding introduced by the 20 standard amino acids, we excluded any structures with
point mutations and to detect motif-mediated PPIs (Ebersberger post-translational modifications in the motif. Second, structures
et al, 2023; Trepte et al, 2018; Mo et al, 2022). Using the BRET that do not resolve all of the residues in a motif as curated by ELM
assay, we were able to detect 11 out of 28 PPIs from the HuRI DB were excluded. Third, we restrict our studies to only binary
dataset. This retest rate is actually higher compared to retest rates interactions, so DMIs that require more than two proteins to form
of gold standard PPI datasets used in the past to benchmark various the binding interface were excluded. Likewise, DMIs with only
binary PPI assays including this BRET assay, attesting the overall intramolecular interaction evidence were excluded. We manually
detectability of PPIs from HuRI (Braun et al, 2009; Trepte et al, annotated the boundaries of the domains by visual inspection of
2018; Choi et al, 2019). The NL and mCit fusions used in the BRET the structures. After this filtering, we identified 136 structures
assay allowed us to monitor the expression levels of wildtype and from distinct ELM classes that formed our DMI benchmark
mutant constructs, which is important to rule out loss of binding dataset (Dataset EV2).
because of a destabilization of the protein. However, we cannot
exclude the possibility that some expressed mutants might still be Sequence identity of the domains in the DMI benchmark dataset
partially unfolded or mislocalized and thus, some loss of binding We took all the binding domains in the DMI benchmark dataset
detected in our study could be unspecific and not the result of a and computed their pairwise sequence identity from a global
specific perturbation of the predicted interface. Furthermore, alignment without gap penalties. Matching residues were given a
preservation of binding observed for some other mutants at the score of 1, otherwise 0. The sum of these scores was divided by the
predicted interface might result from the mutations not being length of the longer sequence to compute the sequence identity.
disruptive enough and thus, do not necessarily disprove the
predicted interface. Selection of structures for the DDI benchmark dataset
Despite these limitations, we were able to assess the validity of
seven interface predictions using experimentation. We discovered a We randomly selected 80 pairs of Pfam domain types that were
likely novel DMI type that mediates binding between PEX3 and described in the 3did resource (Mosca et al, 2014) to be in contact
PEX16, and proposed a model for how PEX3, PEX16, and PEX19 with each other in solved structures in the Protein Data Bank
form a trimeric complex at the peroxisomal membrane. We also (PDB). We manually inspected all PDB entries listed to contain
validated a variation of the LIG_GYF motif class in SNRPB that contacts between instances of a given Pfam domain pair until we
mediates binding to GIGYF1 thereby potentially connecting mRNA found one that we considered a genuine domain-domain interac-
splicing with posttranscriptional control mechanisms. These results tion. These decisions were primarily based on the number of atomic
confirm in principle that AF is able to predict novel interface types contacts observed and the validity that two folded domains were
and that it can be used to extend existing interface type definitions. interacting with each other. Out of the 80 selected Pfam domain
However, our experimental results also highlight clear limitations pairs, we identified 48 DDI types and 48 corresponding approved
of AF predictions. Our data suggests that FBXO28 and STX1B as DDI structural instances that we selected for the DDI benchmark
well as STX1B and VAMP2 interact via coiled-coil interfaces but dataset. The sequences of the minimal interacting domain regions
likely at higher stoichiometries and different conformations than were manually annotated by visual inspection of the structures and
predicted. We confirmed the binding pocket in ESRRG but not the used for prediction. A more detailed description of the curation
predicted interfaces in PSMC5 and we could not substantiate procedure and information on the pairs will be soon published
interface predictions for TRIM37 and PNKP. Highly confident elsewhere (Geist et al, in preparation).
interface predictions were obtained for seven additional PPIs that
await experimental validation. In summary, we provided experi- Generation of random reference sets with minimal
mental evidence and structural information for PPIs whose interacting regions
disruption is likely associated with neurodevelopmental disorders.
This information can be explored in future studies aimed at Mutating motif sequences
delineating potential molecular mechanisms causing disease. Our Key conserved residues of the motifs in the DMI benchmark dataset
study furthermore laid out clear limitations, perspectives, and were identified computationally using the regular expression of the
future needs in AI-based structure prediction to bring us closer to a corresponding ELM class in the ELM DB and SLiMSearch
fully structurally annotated human protein interactome. (Krystkowiak and Davey, 2017). The defined positions are any
positions in the regular expression that are not wildcards. To
mutate the key residues to the ones with opposite physico-chemical
Methods properties, we substituted one or two key residues with the ones
that are of the largest Miyata distance (Miyata et al, 1979) (Dataset
Selection of structures for DMI benchmark dataset EV2).
To gather a list of ELM classes with structural evidence and Randomizing pairings of known domain-motif interfaces
annotate their minimal interacting fragments, we downloaded a To simulate non-binding domain-motif pairs, we randomized the
dataset of solved structures of all ELM classes from ELM DB on pairings of known domain motif interfaces. As some domain types
08.10.2021 (ELM class version 1.4) for instances that are can bind to motifs from distinct ELM classes, we manually checked
88 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
116
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
that the randomized pairings did not coincide with actual domain- This resulted in 563 and 540 predictions from the positive reference
motif interface types (Dataset EV2). set extensions for AF v2.2 and v2.3, respectively.
Randomizing pairings of known domain-domain interfaces Selection of reference datasets for comparison of AF
The pairings between known domain-domain interfaces were v2.2 with v2.3
randomized to form the random reference set for DDIs.
All predictions for the minimal DMIs and the random DMIs
Generation of positive DMI reference set with involving minimal fragments were successfully modeled by both
fragment extensions versions of AF. Some extensions from the positive reference set
were not successfully modeled by AF v2.2 and v2.3 due to failure
Among the 136 solved structures that we selected previously, we from HHblits. To compare AF v2.2 with v2.3, we used only
further filtered for structures that consist of only human proteins. predictions that were successfully modeled by both versions of AF.
To test the potential effect of extension on DMIs that were This resulted in 616 predictions from the extensions of the positive
predicted with different accuracies in their minimal forms, we reference set.
selected 12 DMI types from the correct sidechain category, 8 DMI
types from the correct backbone category and 11 DMI types from Evaluation of AF sensitivity and specificity when using
the correct pocket category as determined using the motif RMSD the fragmentation approach
calculation. In total, 31 DMI types were selected for extension.
Three additional DMI types were originally selected but later on Among the 34 DMIs selected for extension, we further selected 20
discarded because they contained secondary motif occurrences DMIs and retrieved the PPIs mediating these DMIs as the PRS and
complicating data analysis. The extensions were done on the randomized their pairing to form random domain-motif protein
canonical sequence of the proteins used to solve the structure. pairs as the RRS. The 20 PPIs from the PRS and the 20 protein pairs
Motif extension 1 extended the motif sequence at both N and C from the RRS were subjected to the fragmentation approach,
termini by n residues where n is the length of the known motif. generating 8943 fragment pairs and 11,045 fragment pairs for the
Motif extension 2 further extended the motif sequence by another n PRS and RRS, respectively. All fragment pairs from the PRS and all
residues at both termini. Motif extension 3 and 4 each extended the but one fragment pair from the RRS resulted in an AlphaFold
motif sequence by 2n residues at both termini. Motif extension 5 model. Models were deemed highly confident, if the disordered
extended the motif sequence by including neighboring domains fragment had a motif interface pLDDT of ≥70 or, in case of
and motif extension 6 used the full-length protein sequence. On the ordered-ordered models, the average interface pLDDT scored ≥70.
domain side, domain extension 1 extended the domain sequence to To evaluate the sensitivity of the fragmentation approach, we
include the disordered regions N- and C-terminally of the binding considered all models that met the above mentioned cutoffs and
domain until it reached neighboring domain(s) boundaries. which contained the motif and domain sequence. We super-
Domain extension 2 included the sequence region of the imposed the models onto the corresponding native structures using
neighboring domains and domain extension 3 used the full- the minimal domain and computed the RMSD between the
length protein sequence. In cases where the known motif or binding minimal motif residues in the native and modeled structure. A
domain is at the C terminus, we extended the motif or domain model was deemed accurate if the motif RMSD was ≤5 Å. At this
sequence on only the N terminus and vice versa. There were some cutoff the backbone of the native and modeled motif are well
cases where the last extension steps, motif extension 6 and domain aligned but not necessarily their side chains (see also RMSD
extension 3, extended the protein minimally (<20 residues N or C subsection below). We repeated the same procedure for each DMI
terminal to the previous extension step). These cases were excluded protein pair using full length sequences as input into AF for
from the analysis. The dataset of extended DMIs is in Dataset EV5. modeling. In 18 cases AF did not return a model when using full
In total, 709 fragment pairs were submitted to AlphaFold. From length sequences. Here, we used the largest protein fragments
these, 632 and 616 were successfully modeled by AF v2.2 and v2.3, instead for which AF returned a model. Information on the protein
respectively. pairs, prediction results, and statistics is available in Dataset EV9.
Generation of random DMI reference set with AlphaFold versions and runs
fragment extensions
We used local installations of AlphaFold Multimer version 2.2.0
To generate a random reference set using the extensions, we and 2.3.0 (preprint:Evans et al, 2021) for all protein complex
randomized the pairings of the 34 DMI types that we selected for predictions with the following parameters:
extensions and paired their extensions for prediction. Motif --max_template_date=2020-05-14
extension 6 and domain extension 3 were excluded from the --db_preset=full_dbs
pairing. The dataset of DMIs with random pairings and their --use_gpu_relax=False
extensions can be found in Dataset EV6. In total, 612 predictions For every AlphaFold run, five models were predicted with single
were generated, among which 566 and 522 predictions were seed per model by setting the following parameter:
successfully modeled by AF v2.2 and v2.3, respectively. Since motif --num_multimer_predictions_per_model=1
extension 6 and domain extension 3 were excluded from the The databases queried during AlphaFold predictions were
random reference set using the extensions, we also excluded them specified following the instructions from the github page of
from the positive reference set extensions during ROC analysis. AlphaFold
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 89
117
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
(https://github.com/deepmind/alphafold#running-alphafold): motif chain in AlphaFold models and the motif chain in the solved
For running AlphaFold Multimer v2.2, the following databases structure. To ensure that the RMSD calculation was done based on all
were queried: atom identifiers and without any outlier rejection refinement, the
--bfd_database_path=bfd_metaclust_clu_complete_id30_c90_- arguments of the rms_cur command, matchmaker and cycles, were set
final_seq.sorted_opt to 0. Prediction accuracy categories were defined based on motif RMSD
--mgnify_database_path=alphafold_v220_databases/ cutoffs: RMSD ≤ 2 Å for correct sidechain, between 2 Å and 5 Å for
mgy_clusters_2018_12.fa correct backbone, between 5Å and 15 Å for correct pocket and >15 Å
--obsolete_pdbs_path=alphafold_v220_databases/pdb_mmcif/ for wrong pocket.
obsolete.dat
--pdb_seqres_database_path=alphafold_v220_databases/ DockQ
pdb_seqres/pdb_seqres.txt The calculation of DockQ scores of AlphaFold models was done in
--template_mmcif_dir=alphafold_v220_databases/pdb_mmcif/ reference to their solved structures using the code available on the
mmcif_files github repository of DockQ (https://github.com/bjornwallner/
--uniprot_database_path=alphafold_v220_databases/uniprot/ DockQ, (Basu and Wallner, 2016). DockQ classification was done
uniprot.fasta using the cutoffs provided by DockQ (DockQ: <0.23 for incorrect,
--uniclust30_database_path=alphafold_v220_databases/uni- between 0.23 and 0.49 for acceptable, between 0.49 and 0.80 for
clust30/uniclust30_2018_08/uniclust30_2018_08 medium and ≥0.80 for high).
--uniref90_database_path=alphafold_v220_databases/uniref90/
uniref90.fasta pDockQ
For running AlphaFold Multimer v2.3, the following databases The calculation of pDockQ of AlphaFold models was done by
were queried: adapting the code available on the github repository from the
--bfd_database_path=alphafold_v230_databases/bfd/ Elofsson lab (https://gitlab.com/ElofssonLab/FoldDock/-/blob/
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt main/src/pdockq.py, (Bryant et al, 2022)). The pDockQ score is
--mgnify_database_path=alphafold_v230_databases/mgnify/ created by fitting a sigmoidal curve to the DockQ scores of a series
mgy_clusters_2022_05.fa of AlphaFold predicted models. The score takes into account the
--obsolete_pdbs_path=alphafold_v230_databases/pdb_mmcif/ number of interface contacts as well as their pLDDT scores. Of
obsolete.dat note, the calculation of pDockQ score takes Cβs (Cα for glycine)
--pdb_seqres_database_path=alphafold_v230_databases/ from different chains within 8 Å from each other as interface
pdb_seqres/pdb_seqres.txt contacts which is different from our interface definition (see the
--template_mmcif_dir=alphafold_v230_databases/pdb_mmcif/ subsection below Domain chain and motif chain interface pLDDT
mmcif_files and average interface pLDDT).
--uniprot_database_path=alphafold_v230_databases/uniprot/
uniprot.fasta iPAE
--uniref30_database_path=alphafold_v230_databases/uniref30/ The calculation of iPAE of AlphaFold models was done by adapting
UniRef30_2021_03 code available on the github repository https://github.com/fteufel/
--uniref90_database_path=alphafold_v230_databases/uniref90/ alphafold-peptide-receptors/tree/main (Teufel et al, 2023). The iPAE is
uniref90.fasta the median predicted aligned error at the interface. The authors
To test the effect of template use on prediction accuracy, the consider residues in contact if their distance is below 0.35 nm (3.5 Å).
following parameter setting was used to switch off the use of The iPAE score could not be calculated for models generated by
templates during the prediction: AlphaFold Multimer version 2.3.0 due to JAX dependency of the pickle
--max_template_date=1950-01-01 files generated by AlphaFold Multimer version 2.3.0.
For the fragmentation approach, the multiple sequence align-
ments (MSAs) of a given protein fragment can be reused in Model confidence
subsequent runs where the same fragment is involved. The MSAs The model confidence of AlphaFold models was extracted from the
were first moved to the prediction output folder and the following ranking_debug json file. The model confidence is a weighted
parameter was added to enable the reuse of MSAs. combination of pTM and ipTM to account for both intra- and
--use_precomputed_msas=True interchain confidence:
For efficient computing, we segregated the MSA generation part
by using only the CPUs and the model fitting part using the GPUs. model confidence ¼ 0:8 " ipTM þ 0:2 " pTM
Calculation of metrics for structural models
Domain chain and motif chain interface pLDDT and average
Motif RMSD interface pLDDT
We used the software PyMOL (TM) Molecular Graphics System, Since AlphaFold conveniently stores the pLDDT confidence
Version 2.5.0. Copyright (c) Schrodinger, LLC., for the superimposition measure for each residue in the B-factor field of the output PDB
of AlphaFold models with corresponding solved structures. First, we files, the pLDDT of residues at the interface was parsed from the
used the align command to align the domain chain in AlphaFold output PDB files of AlphaFold. Residues at the interface are defined
models with the domain chain in the solved structure. Then, we used as those that have at least one heavy atom that is less than 5 Å away
the rms_cur command to calculate the all-atom RMSD between the from any heavy atom of the other chain (calculated using the
90 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
118
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
PyMOL API). The pLDDT of the residues at the interface from the AF prediction. The ELM instances were extended at both N and C
domain chain and motif chain was averaged to compute the termini by n residues where n is the length of the ELM instance,
domain chain and motif chain interface pLDDT, respectively. The according to the benchmarking results. The minimal binding domains
pLDDT of all the residues from both chains was averaged to of the ELM instances were detected in the interaction partner using
compute the average interface pLDDT. Pfam HMMs (Mistry et al, 2021). As the domain boundaries detected
by Pfam HMMs could be inaccurate, we also extended the domain
Residue-residue and atom-atom contacts sequence at the N and C terminus by 20 residues to ensure that the
Following the interface definition above, the number of unique whole folded region was covered. The predictions were performed using
residue-residue and atom-atom contacts were also quantified as AF version 2.3.0. To select a subset of these motif classes, where we can
measurements to assess AlphaFold models. do experimental testing, we also used the InParanoid resource (Persson
& Sonnhammer, 2023) to map ELM instances where both proteins are
Mean DockQ between predicted models from mouse to their human orthologs. To verify that they indeed do not
The top five models generated by AF, determined based on their have structural homologues in the PDB, we both used the SIFTS
model confidence, were considered for computing this metric. To mapping (Dana et al, 2019) between the Pfam domain in ELM and the
quantify the similarity among the models, we computed DockQ PDB and also looked at the ELM classes that were listed as homologs on
scores between all possible pairs of models by taking the higher the ELM website.
ranked model as the “template” model and lower ranked model as
the “predicted” model. The mean of these DockQ scores is taken as Evaluation of effect of fragment extensions on AF
the similarity among the models in a given prediction. This prediction accuracies
calculation was done for AF models of minimal DMIs and their
randomizations for ROC analysis. The data were stored in Dataset We superimposed the AF models generated with DMI extensions
EV2. onto the corresponding solved DMI structures to quantify AF
prediction accuracy using motif RMSD calculations. To this end,
Quantification of motif properties we aligned the two structures on their minimal binding domains
and calculated the all-atom RMSD between the minimal motif in
Motif hydropathy score and symmetry score the extension AF model and the minimal motif in the solved
By referring to the Kyte-Doolittle hydrophobicity scale, (Kyte & structure. To determine potential differences in DMI prediction
Doolittle, 1982) the hydropathy scores of the amino acids in a given accuracy when using minimal versus extended protein fragments,
motif were summed and averaged to compute the average we computed the log2 fold change of the all-atom motif RMSD
hydropathy of the motif. The average motif symmetry score was before and after extension.
computed by taking the sum of the absolute difference of ! "
all atom RMSD motif
hydropathy scores between motif position n and n - motif length Fold change in prediction accuracy ¼ log minimal DMI2 all atom RMSD motif
+ extended DMI1 and division of this sum by half of the motif length:
Pa jðH $H Þj
Peptide symmetry score ¼ n¼1 n x$nþ1
a
Fragment design and fragment pairing for
fragmentation approach
where x is the length of the motif and a is the floor division of x by 2.
We first inspected the monomeric structural models from the
AlphaFold database (Varadi et al, 2022; Jumper et al, 2021) of both
Motif probability interacting proteins to determine the boundaries of their ordered
The motif probability reflects the degeneracy of a given motif class and coiled-coil regions, which were also treated as “ordered”. All
as quantified by its regular expression that is annotated in the ELM regions that were not annotated as ordered were annotated as
DB. The motif probability was retrieved from the ELM DB disordered. In some cases, an extended loop with low pLDDT can
version 1.4. be found within an ordered region. As they can also potentially
carry a motif or mediate interactions in another way, these regions
Secondary structure elements of motifs were also annotated as disordered in addition to their annotation as
We extracted the secondary structure elements of motifs using the being part of a larger ordered region. The disordered regions of the
PyMOL API. In cases where the motif adopts partial secondary proteins were fragmented into fragment sizes of 10, 20 and 30
structure, such as loop-helix-loop or loop-strand-loop, they are residues. To allow AF to sample continuous sequences, we also
treated as helical or strand, respectively. generated another set of fragments of same sizes that overlap with
the previous fragments by sliding the sequence by half the size of
Selection of motif classes from ELM DB without the fragment. The unfragmented disordered regions, as well as their
annotated structural instances and prediction with AF fragments, from one protein were then paired with the ordered
regions from its interacting partner and vice versa for prediction.
By querying the ELM DB for all ELM classes, we retrieved a list of ELM The ordered regions from both proteins were also paired for
classes and the number of instances with a structure solved (column prediction. We decided to manually define boundaries between
#instances_in_PDB). We filtered for ELM classes with 0 instance- ordered and disordered regions because testing available code
s_in_PDB and selected 205 instances out of the filtered ELM classes for developed for this purpose, like clustering using the PAE matrix,
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 91
119
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
turned out to be too inaccurate. We observed that erroneous Based on clone availability, we selected 49 of the 62 PPIs for
removal of residues close to the domain borders that are still experimental validation of the predicted interfaces using the BRET
contributing to the folding of a structured domain, can heavily assay. For 30 of the 49 selected PPIs for experimental testing we
mislead AF predictions. obtained sequence-confirmed clones with luciferase and mCitrine
fusions. For 28 of these PPIs both partners were expressed in our
Selection of NDD proteins experimental system as determined by total luminescence and
fluorescence measurements (Fig. 3D,F).
A list of NDD genes was assembled using whole exome and whole
genome sequencing studies of cohorts of NDD patients from Softwares used
Gene4Denovo (Zhao et al, 2020) and Deciphering Developmental
Disorders (DDD) study (Firth et al, 2011), respectively. From We used the software PyMOL (TM) Molecular Graphics System,
Gene4Denovo, we selected genes linked to autism-spectrum Version 2.5.0. Copyright (c) Schrodinger, LLC., for the visualization
disorders (ASD), intellectual disability (ID), epilepsy (EE), and superimposition of AlphaFold models.
undiagnosed developmental disorders (UDD) and NDDs in All codes were written in Python3 and analyses were done using
general. Genes with non-coding mutations as well as genes with a Jupyter notebooks. We used the Python libraries, Biopython (Cock
false discovery rate (FDR) >= 0.05 were excluded. Similarly, in the et al, 2009) for sequence similarity computation, pandas (McKin-
DDD study, genes associated with developmental disorders with a ney, 2010) for data analysis, and Matplotlib (Hunter, 2007) and
neurological component, as well as genes found to be mutated in at seaborn (Waskom, 2021) for data visualization. ROC and PR
least three children with NDDs (labeled as confirmed genes) were statistics were calculated using the Python package sci-kit learn
retained. The final list included 984 NDD-risk genes. We filtered (Pedregosa et al, 2012).
the HuRI network (Luck et al, 2020) for interactions mediated
exclusively by proteins from this NDD gene list resulting in 67 PPIs Cell line culture and maintenance
excluding self-interactions. Since our fragmentation approach
generates many fragments, we did not consider PPIs involving HEK293 cells were purchased from DSMZ (catalog number ACC305).
proteins that are more than 1500 amino acids in length, resulting in These cells were grown and maintained in DMEM (Thermo Fisher),
a final list of 62 PPIs that were subjected to AF modeling. supplemented with 10% FBS (PAN-Biotech), 2mM glutamine (Thermo
Fisher) and 1% penicillin–streptomycin (Thermo Fisher). Cells were
Manual inspection of interface predictions for NDD-NDD incubated at 37 °C with 5% CO2. Subcultivation was performed with
PPIs and selection for experimental validation 1ml of 0.05% trypsin every 2–3 days for up to 40 passages. For each
passage 1–2 × 106 cells were seeded in T25 flasks (Sarstedt). Then, new
Paired fragments from NDD-NDD PPIs were predicted using AF cells were thawed from stocks containing 2 × 106 cells in 1ml of growth
version 2.2 and the prediction results are stored in Dataset EV10. medium, supplemented with 10% DMSO (Sigma). Every 3 months cells
Based on our benchmarking results, we started by manually were checked for mycoplasma contamination using a PCR test (Dataset
inspecting all NDD-NDD PPIs that obtained at least one structural EV11). The cell line was purchased from DSMZ four years ago,
model with either a motif chain interface pLDDT of ≥70 for the expanded, aliquoted, and frozen. A new aliquot is thawed after every 40
disordered fragment or with an average interface pLDDT ≥ 70 for passages. No further authentication of the cell line has been done.
structural models with predicted ordered-ordered interfaces
(DDIs). However, during the course of these manual inspections, Plasmid construction
we found that using in addition a model confidence of ≥0.7 for
ordered-ordered fragment pairs helped discriminating good from Standard controls
bad structural models. We inspected the ranked_0 models for all The donor and acceptor vectors pcDNA3.1-cmyc-NL-GW
fragment pairs that met the above cutoffs but also inspected models (Addgene plasmid ID #113446), pcDNA3.1-GW-NL-cmyc
scoring somewhat below these cutoffs. For every NDD-NDD PPI (Addgene plasmid ID #113447), pcDNA3.1 GW-His3C-mCit,
we used Interactome3D (Mosca et al, 2013) and PDB database pcDNA3.1 mCit-His3C-GW as well as controls pcDNA3.1-NL-
searches (https://www.rcsb.org/ (Berman et al, 2000)) to identify cmyc (Addgene plasmid ID #113442), pcDNA3.1-PA-mCit
whether a structure already existed for this PPI. In our evaluation of (Addgene plasmid ID #113443) were kindly provided by the
the structural models we also considered if a certain interface was Wanker Group (Max-Delbrück-Centrum für Molekulare Medizin,
recurrently predicted for different overlapping fragments because Germany) (Dataset EV12). By default we cloned all ORFs of
this usually hints at increased confidences for the correctness of the interest into N-terminal NL and mCit fusion destination vectors
interface prediction. We furthermore explored the number and and occasionally also transferred ORFs into C-terminal fusion
kind of residue-residue contacts predicted by AF by visual vectors if N-terminal fusions did not result in sufficient BRET
inspection of the structural models using PyMol. We searched for signals but the interaction was of high interest to this study and
functional annotations and existing structures for the monomers predicted interfaces were closer to the C-terminus. Trepte et al have
using the PDB, ProViz (Jehl et al, 2016), SMART (Letunic et al, shown that testing protein pairs in different configurations
2021), and the scientific literature to identify enzymatic pockets or increases detection rates while maintaining low false detection
binding interfaces for DNA, RNA, or metal ions. Observations and rates and that BRET signals are higher if fusions are close to the
justifications for the final evaluation of the predictions for every actual interaction interface (Trepte et al, 2018; preprint:Trepte et al,
NDD-NDD PPI are provided in Appendix Supplementary Text S1. 2021; preprint:Trepte et al, 2023).
92 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
120
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
GATEWAY cloning procedure 6. The primer ideally should start and end with guanine or cytosine.
Full-length wild-type human open reading frames (ORFs) being 7. The designed oligos were grouped by annealing temperature for
cloned in GATEWAY entry vectors from the ORFeome collaboration the next step.
are stored as bacterial glycerol stocks. (ORFeome Collaboration, 2016) 8. In 96-well PCR plate 10 ng of DNA template together with oligos
were used per 50 µL of PCR reaction (denaturation at at 98 °C for
1. The ORFs were inoculated in 96-well plates (Corning), with each 2 min, annealing for 15 s and extension at 72 °C for 5 min, 25
well containing 200 uL of LB medium and 100 µg/ml ampicillin. cycles of amplification) using phusion high-fidelity
The plate was incubated at 37 °C and left to shake overnight at polymerase (NEB).
190 rpm. 9. 1 µL of DpnI (NEB) was added to the plate with PCR products
2. In a 96-well PCR plate (Brand) 10 ng of each selected ORF was and incubated at 37 °C for 1 h. The reaction was stopped at 65 °C
used per 50 µl PCR reaction (denaturation at 98 °C for 10 s, for 20 min.
annealing at 55 °C for 30 s and extension at 72 °C for 3 min, 30 10. The PCR products (6 µl per well) were confirmed through 96-well
cycles of amplification) using phusion high-fidelity polymerase E-gel with SYBR (Thermo Fisher, Catalog no G720801) using
(NEB) and primers annealing to the backbone of the plasmid 25 µl of loading buffer (Thermo Fisher) and 20 µl of E-Gel 96
(forward: 5′TTGTAAAACGACGGCCAGTC and reverse: 5′ High range DNA marker (Thermo Fisher).
GCCAGGAAACAGCTATGACC). 11. 3 µL of digested PCR product was transformed into chemically
3. The PCR products (6 µl per well) were confirmed through 96-well competent DH5a cells (30 µL) in a 96-well PCR plate, then
E-gel with SYBR (Thermo Fisher, Catalog no G720801) using recovered in 80 µL of pre-warmed SOC medium at 37 °C for 1 h
25 µl of loading buffer (Thermo Fisher) and 20 µl of E-Gel 96 without shaking.
High range DNA marker (Thermo Fisher). 12. 70 µL of transformed bacteria was plated on 48-well square agar
4. In a 96-well PCR plate 1 µl of each amplified PCR product plates and incubated at 37 °C overnight.
together with 200 ng of above-mentioned destination vectors were 13. Afterwards, colonies were selected and inoculated into a 96 deep-
directly used per 10 µl LR reaction using 4x LR clonase well plate containing 2 ml of LB medium and 100 µg/ml
(Invitrogen), thereby generating expression vectors. ampicillin. The plate was then incubated at 37˚C with continuous
5. The full 10 µl of LR reaction was transformed into chemically shaking at 700 rpm in the incumixer for 24 h.
competent DH5a cells (30 µl) in a 96-well PCR plate, then 14. The amplified vectors were extracted from the inoculated
recovered in 80 µl of pre-warmed SOC medium at 37 ˚C for 1 h culture with Plasmid Plus 96-well Miniprep kit (Qiagen). The
without shaking. concentration was measured with a Nanophotometer and
6. 70 µl of transformed bacteria was plated on 48-well square agar diluted to 100 ng/µl. Next, 600 ng of insert was used for full-
plates and incubated at 37 °C overnight. length sequencing using primers covering the mutation and
7. Afterwards, colonies were selected and inoculated into a 96 deep- ORF-specific primers (Dataset EV11) to fully cover the ORF
well plate containing 2 ml of LB medium and 100 µg/ml length (Dataset EV12).
ampicillin. The plate was then incubated at 37 ˚C with continuous
shaking at 700 rpm in the incumixer for 24 h.
8. The amplified vectors were extracted from the inoculated culture BRET assay
using Plasmid Plus 96-well Miniprep kit (Qiagen). The
concentration of each vector was measured with a Nanophot- Transfection
ometer and diluted to 100 ng/µl. Next, 600 ng of insert was used HEK293 cells were grown and maintained in high-glucose (4.5 g/l)
for full-length sequencing using the backbone primers (tag- DMEM (Thermo Fisher) for BRET assays. Media was supplemen-
specific NanoLuc forward: 5′GAACGGCAACAAAATTATC- ted with 10% fetal bovine serum (PAN-Biotech) and 1% Penicillin/
GAC, mCitrine forward: 5′AGCAGAATACGCCCATCG and Streptomycin. Cells were grown at 37 °C, 5% CO2, and 85% RH.
reverse: 5′GGCAACTAGAAGGCACAGTC) and ORF-specific Cells were subcultured every 2–3 days and transfected with
primers (Dataset EV11) to fully cover the ORFs where it was lipofectamine 2000 transfection reagent (Invitrogen) in Opti-
needed (Dataset EV12). All sequence-confirmed ORF sequences MEM medium (Thermo Fisher) using the reverse transfection
used in this study are available in Dataset EV13. method according to the manufacturer’s instructions. For transfec-
tions, cells were seeded at a density of 4.0 × 104 cells per well in a
white 96-well microtiter plate (Greiner) in phenol-red-free, high-
Site-directed mutagenesis glucose DMEM media (Thermo Fisher) supplemented with 5%
The primers were manually designed using the following criteria: fetal bovine serum (Thermo Fisher). Transfections were performed
with a total DNA amount of 200 ng per well. If the expression
1. For point mutation the primers should overlap the site of plasmid concentration amount was below 200 ng/well, pcDNA3.1
mutation. The overlap should be 15–20 nucleotides (nt). (+) was used as a carrier DNA to reach the total amount of DNA of
2. For the deletion the primers should be designed to exclude the 200 ng. All protein pairs were tested in both N-terminal fusion
deletion site, but still overlap and the overlap should be as orientations (NL-A with mCit-B and NL-B with mCit-A). The
mentioned in step 1. following proteins were also tested as C-terminal fusions: CSNK2B-
3. Primer length should be in the range of 32–36 nt. NL, ESRRG-NL, CUL3-NL, PEX3-NL, PEX19-NL, PSMC5-NL,
4. GC content should be between 40–60%. PEX3-mCit, PEX19-mCit, PEX16-mCit, RORB-mCit, ESRRG-
5. Difference in melting temperature of primers should not mCit, PAX6-mCit, CSNK2B-mCit, PSMC5-mCit, KCTD7-mCit
exceed 5 °C. (Dataset EV12).
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 93
121
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
Measurement Fitting of titration curves
The plate was incubated 2 days at 37 °C, 5% CO2, and 85% RH Titration curves were fitted using the leastsq function from the
before measurements. All measurements were done with the scipy.optimize python package (Virtanen et al, 2020) using the
Infinite M200 Pro microplate reader (Tecan). First, 100 µl of the model BRET = ((A/D) * BRETmax)/(BRET50 + (A/D)) described
medium was aspirated from each well. The mCitrine fluorescence in (Drinovec et al, 2012), which assumes a 1:1 binding mode, to
(FL) was measured in intact cells (excitation/emission 513 nm/ obtain estimates for the BRETmax and BRET50. Standard errors of
548 nm) using a gain of 100. On rare occasions, the plate reader the BRET50 estimates were obtained from the variance-covariance
recorded an overflow with these settings (i.e. for GIGYF1 matrix, calculated by multiplying the fractional covariance matrix
constructs). In these cases, we repeated the measurement with (output by leastsq function) by the residual variance. Measuring BRET
optimal gain settings and used a fluorescein control to normalize signals in intact cells for increasing acceptor/donor protein expression
fluorescence signals measured with different gain settings. For this ratios results in an eventual saturation of the signal. Fitting this curve
purpose, Fluorescein was obtained from Sigma-Aldrich (Catalog allows extraction of the maximal BRET that can be reached and the
No 46955-250MG-F) and used without further purification. A stock BRET50, which is the acceptor/donor ratio at which half of the
solution of Fluorescein (1 mg/ml in Ethanol) was prepared by maximal BRET is obtained. The BRET50 is indicative of binding
dissolving 1.3 mg Fluorescein in 1.3 ml absolute ethanol. 100 µl of a affinity, in analogy to the IC50, however, its accurate estimation
20 µg/ml solution of Fluorescein were added to an empty well requires saturation of the BRET to be observed in the experimental
immediately before starting the fluorescence measurements. The system, which cannot always be achieved because of limited amounts
20 µg/ml solution of Fluorescein was obtained by preparing a 1:50 of DNA that cells can be transfected with. Alternatively, if mutations
dilution in water of the stock solution. After measuring the are unlikely to change the overall structure of the fusion constructs and
fluorescence, coelenterazine-h (PJK Biotech GmbH) was added to a do not alter expression levels compared to wildtype, single point BRET
final concentration of 5 µM. The cells were briefly shaken for 15 s measurements at acceptor/donor ratios prior to BRET saturation are
and incubated for 15 min inside the plate reader at 37 °C. After also indicative of changes in binding strength. The BRET titration
incubation, total luminescence was measured first followed by curves that we obtained for the PNKP-TRIM37 interaction clearly
short-wavelength (WL) and long-wavelength luminescence (LU) deviated from the assumed 1:1 binding mode because at higher
measurements using the BLUE1 (370–480 nm) and the GREEN1 acceptor:donor ratios we observed a sudden increase in BRET again
(520–570 nm) filters at 1000 ms integration time. Corrected BRET contrary to an expected saturation. The model could thus not be fitted
ratios were calculated as described in (Trepte et al, 2018). Briefly, to the titration data.
for every transfected protein pair NL-A and mCit-B, the following
two control pairs were measured: NL-Stop with mCit-B and NL-A Antibodies
with mCit-Stop. The maximal BRET from both control pairs was
subtracted from the actual test pair to correct for donor Purified anti-HA.11 Epitope Tag, Clone: [16B12], Mouse, Mono-
bleedthrough, unspecific binding to the tags, and background clonal (Biolegend, BLD-901502), 1:2000.
signal. Purified anti-GIGYF1, Rabbit, Polyclonal (BETHYL labora-
tories, Cat. #A304-132A-1), 1:1000.
Determination of binding events in BRET assay GAPDH Loading Control Monoclonal Antibody (GA1R), HRP-
To determine whether a protein pair interacted in the BRET assay coupled (Thermo Fisher Cat. MA515738HRP), 1:3000.
or not, we used donor:acceptor DNA transfection ratios of 2:50 ng
in all cases except for PEX3-PEX16 where we used 8:25 and Co-immunoprecipitation and western blot
PEX3:PEX19 where we used 8:50 ng DNA ratios due to low
expression levels of PEX3 and a degradation effect of higher PEX16 Snrpb (full-length) and C-terminal truncation mutant (amino acids 1-
protein levels on PEX3 expression levels. We requested that 190) was cloned from mouse cDNA and ligated into pFRT-TO
cBRETs determined at these transfection ratios were ≥0.05, destination plasmid using AscI and PacI restriction sites. The constructs
fluorescence measurements representing mCitrine fusion expres- additionally contain C-terminal 2xHA and mNeonGreen tags. Flp-In™
sion levels to be ≥500 units, and total luminescence measurements T-REx™ 293 Cell Lines (Thermo Fisher, catalog number: R78007)
representing NL fusion expression levels to be ≥50,000. expressing Snrpb endogenously from a single locus were generated
according to the manufacturer’s instructions. In brief, pFRT-TO and
Saturation assay pOG44 plasmids were co-transfected and hygromycin-resistant colonies
For donor saturation experiments various donor DNA amounts (1, were grown, picked and expanded. The Snrpb transgene expression was
2, 4 and 8 ng) encoding NL-fused proteins were co-transfected with validated by western blot, RT-qPCR, and immunofluorescence, which
increasing amounts of acceptor DNA (12.5, 25, 50, 100, 200 ng) showed that ectopic Snrpb-HA was expressed at levels highly similar to
encoding mCitrine-fused proteins. Fluorescence, total lumines- the endogenous Snrpb protein.
cence, and BRET measurements were done as described before. For the co-immunoprecipitation experiments, 8 × 106 cells were
BRET measurements were corrected for bleedthrough using NL- seeded in a 10 cm dish. The following day, expression of Snrpb-HA
Stop transfections. Fluorescence and total luminescence measure- was induced by adding 0.1 μg/mL Doxycycline (D9891, Sigma
ments were corrected for background signal using transfections Aldrich) to the culture medium. Parental cells not expressing any
with pcDNA3.1(+) and subsequently used to estimate amounts of HA-tagged transgene were used as a negative control of
expressed proteins and to plot acceptor/donor ratios on the x-axis immunoprecipitation. The next morning the cells were harvested
of titration plots. by scraping in culture media, followed by centrifugation and a
94 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 – 97 © The Author(s)
122
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
single wash in ice-cold PBS. The whole cell extract was prepared by Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, Bryant P,
15 min incubation on ice with 0.3 mL of lysis buffer (200 mM NaCl, Good LL, Laskowski RA, Pozzati G et al (2022) A structural biology
50 mM HEPES, pH 7.6, 0.1% IGEPAL, 10 mM MgCl2, 10% community assessment of AlphaFold2 applications. Nat Struct Mol Biol
Glycerol, Protease Inhibitor Cocktail (P8340, Sigma Aldrich), 29:1056–1067
Phosphatase Inhibitor (P5726, Sigma Aldrich) followed by 2 cycles Basu S, Wallner B (2016) DockQ: a quality measure for protein-protein docking
of sonication in a Bioruptor Plus (30 s on, 30 s off) and models. PLoS ONE 11:e0161879
centrifugation for 20 min at 16,000 × g. The extract was quantified Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,
by a Bradford assay and 1 mg was used for immunoprecipitation, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242
for which the NaCl concentration was adjusted to 100 mM final Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM,
concentration by diluting with an equal volume of Lysis Buffer Murray RR, Roncari L, de Smet AS, Venkatesan K, Rual JF, Vandenhaute J,
containing 0 mM NaCl. 0.05 mg was set aside as input control (5%). Cusick ME, Pawson T, Hill DE, Tavernier J, Wrana JL, Roth FP, Vidal M (2009)
0.02 mL of Thermo Scientific™ Pierce™ Anti-HA Magnetic Beads An experimentally derived confidence score for binary protein-protein
(Thermo Fisher Cat. 13464229) were incubated with 1 mg protein interactions. Nature Methods 6:91–97
extract for 1 h at 4 °C on a rotating wheel. The beads were washed Bret H, Andreani J, Guerois R (2023) From interaction networks to interfaces:
three times before eluting the immunoprecipitated proteins with Scanning intrinsically disordered regions using AlphaFold2. Preprint at BioRxiv
0.02 mL of 1 x NuPAGE™ LDS Sample Buffer by incubating at 42 °C https://doi.org/10.1101/2023.05.25.542287
for 10 min while shaking at 800 rpm. Another 0.01 mL were used Bronkhorst AW, Lee CY, Möckel MM, Ruegenberg S, de Jesus Domingues AM,
for elution, were then combined making a total of 30 μL, which Sadouki S, Piccinno R, Sumiyoshi T, Siomi MC, Stelzl L, Luck K, Ketting RF
were transferred to a fresh tube and to which 3 μL of 1 M DTT were (2023) An extended Tudor domain within Vreteno interconnects Gtsf1L and
added. Input and immunoprecipitated eluates were then separated Ago3 for piRNA biogenesis in Bombyx mori. EMBO J 42(24):e114072 https://
on a 10% Tris-Glycine SDS PAGE using 1xMOPS buffer, doi.org/10.15252/embj.2023114072
immunoblotted on 0.45 μm PVDF membranes (Tris-Glycin Bryant P, Pozzati G, Elofsson A (2022) Improved prediction of protein-protein
Transfer Buffer, 10% Methanol, 300 mA, 1 hour), blocked with interactions using AlphaFold2. Nat Commun 13:1265
5% milk in TBS-0.2% Tween for 30 min at RT. Primary antibodies Buel GR, Walters KJ (2022) Can AlphaFold2 predict the impact of missense
were incubated overnight at 4 °C on a rocker followed by washes mutations on structure? Nat Struct Mol Biol 29:1–2
and incubation with secondary HRP-labeled antibodies (1 h at RT Bugge K, Brakti I, Fernandes CB, Dreier JE, Lundsgaard JE, Olsen JG, Skriver K,
in 5% milk, TBS-0.2% Tween). Blots were developed using Pierce™ Kragelund BB (2020) Interactions by disorder - a matter of context. Front Mol
ECL Western Blotting Substrate (Thermo Fisher Cat. 32209) or Biosci 7:110
SuperSignal West Femto Maximum Sensitivity Substrate Kit Burke DF, Bryant P, Barrio-Hernandez I, Memon D, Pozzati G, Shenoy A, Zhu W,
(Thermo Fisher Cat. 34095) and imaged on a ChemiDoc MP V3 Dunham AS, Albanese P, Keller A et al (2023) Towards a structurally resolved
(Bio-Rad). The cell line was authenticated via X-Gal staining, qPCR human protein interaction network. Nat Struct Mol Biol 30:216–225
and Sanger Sequencing. Chang L, Perez A (2023) Ranking peptide binders by affinity with AlphaFold.
Angew Chem Int Ed 62:e202213362
Choi SG, Olivet J, Cassonnet P, Vidalain PO, Luck K, Lambourne L, Spirohn K,
Data availability Lemmens I, Dos Santos M, Demeret C, Jones L, Rangarajan S, Bian W, Coutant
EP, Janin YL, van der Werf S, Trepte P, Wanker EE, De Las Rivas J, Tavernier J,
The datasets and computer code produced in this study are Twizere JC, Hao T, Hill DE, Vidal M, Calderwood MA, Jacob Y (2019)
available in the following databases: Maximizing binary interactome mapping with a minimal number of assays.
- Interaction data: submitted to the IMEx (http:// Nature Communications 10:3907
www.imexconsortium.org) consortium through IntAct (Del Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
Toro et al, 2022) and assigned the identifier IM-29904. Hamelryck T, Kauff F, Wilczynski B et al (2009) Biopython: freely available
- Computer scripts for data processing and analysis: available at Python tools for computational molecular biology and bioinformatics.
GitHub under https://github.com/KatjaLuckLab/AlphaFold_ Bioinformatics 25:1422–1423
manuscript. Dana JM, Gutmanas A, Tyagi N, Qi G, O’Donovan C, Martin M, Velankar S (2019)
SIFTS: updated structure integration with function, taxonomy and sequences
Expanded view data, supplementary information, appendices are resource allows 40-fold increase in coverage of structure-based annotations
available for this paper at https://doi.org/10.1038/s44320-023-00005-6. for proteins. Nucleic Acids Res 47:D482–D489
Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A,
Diella F, Dinkel H, Gibson TJ (2012) Attributes of short linear motifs. Mol
Peer review information Biosyst 8:268–281
Del Toro N, Shrivastava A, Ragueneau E, Meldal B, Combe C, Barrera E et al
A peer review file is available at https://doi.org/10.1038/s44320-023-00005-6 (2022) The IntAct database: efficient access to fine-grained molecular
interaction data. Nucleic Acids Res 50(D1):D648–53
Drew K, Lee C, Huizar RL, Tu F, Borgeson B, McWhite CD, Ma Y, Wallingford JB,
References Marcotte EM (2017) Integration of over 9000 mass spectrometry
experiments builds a global map of human protein complexes. Molecular
Ajuh P, Chusainow J, Ryder U, Lamond AI (2002) A novel function for human Systems Biology 13:932
factor C1 (HCF-1), a host protein required for herpes simplex virus infection, in Drinovec L, Kubale V, Nøhr Larsen J, Vrecl M (2012) Mathematical models for
pre-mRNA splicing. EMBO J 21:6590–6602 quantitative assessment of bioluminescence resonance energy transfer:
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 95
123
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Molecular Systems Biology Chop Yan Lee et al
application to seven transmembrane receptors oligomerization. Front Letunic I, Khedkar S, Bork P (2021) SMART: recent updates, new developments
Endocrinol 3:104 and status in 2020. Nucleic Acids Res 49:D458–D460
Durocher D, Taylor IA, Sarbassova D, Haire LF, Westcott SL, Jackson SP, Smerdon Leung AKW, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core
SJ, Yaffe MB (2000) The Molecular Basis of FHA Domain:Phosphopeptide domain and its implication for snRNP biogenesis. Nature 473:536–539
Binding Specificity and Implications for Phospho-Dependent Signaling Lu R, Yang P, O’Hare P, Misra V (1997) Luman, a new member of the CREB/ATF
Mechanisms. Molecular Cell 6:1169–1182 family, binds to herpes simplex virus VP16-associated host cellular factor. Mol
Ebersberger S, Hipp C, Mulorz MM, Buchbender A, Hubrich D, Kang HS, Cell Biol 17:5117–5126
Martínez-Lumbreras S, Kristofori P, Sutandy FXR, Llacsahuanga Allcca L, Luck K, Charbonnier S, Travé G (2012) The emerging contribution of sequence
Schönfeld J, Bakisoglu C, Busch A, Hänel H, Tretow K, Welzel M, Di Liddo A, context to the specificity of protein interactions mediated by PDZ domains.
Möckel MM, Zarnack K, Ebersberger I, Legewie S, Luck K, Sattler M, König J FEBS Lett 586:2648–2661
(2023) FUBP1 is a general splicing factor facilitating 3′ splice site recognition Luck K, Kim D-K, Lambourne L, Spirohn K, Begg BE, Bian W, Brignall R, Cafarelli T,
and splicing of long introns. Molecular Cell 83:2653–2672 Campos-Laborie FJ, Charloteaux B et al (2020) A reference map of the human
Ernst JA, Brunger AT (2003) High Resolution Structure Stability and binary protein interactome. Nature 580:402–408
Synaptotagmin Binding of a Truncated Neuronal SNARE Complex. Journal of Machida YJ, Machida Y, Vashisht AA, Wohlschlegel JA, Dutta A (2009) The
Biological Chemistry 278:8630–8636 deubiquitinating enzyme BAP1 regulates cell growth via interaction with HCF-
Evans R, O’Neill M, Pritzel A, Antropova N, Senior AW, Green T, Žídek A, Bates R, 1. J Biol Chem 284:34179–34188
Blackwell S, Yim J et al (2021) Protein complex prediction with AlphaFold- Matsuzaki T, Fujiki Y (2008) The peroxisomal membrane protein import receptor
Multimer. Preprint at BioRxiv https://doi.org/10.1101/2021.10.04.463034 Pex3p is directly transported to peroxisomes by a novel Pex19p- and Pex16p-
Firth HV, Wright CF, DDD Study (2011) The deciphering developmental disorders dependent pathway. J Cell Biol 183:1275–1286
(DDD) study. Dev Med Child Neurol 53:702–703 McKinney W (2010) Data structures for statistical computing in python. In
Freiman RN, Herr W (1997) Viral mimicry: common mode of association with Proceedings of the 9th Python in Science Conference pp 56–61. SciPy
HCF by VP16 and the cellular protein LZIP. Genes Dev 11:3122–3127 Mishra M, Jiang H, Wei Q (2023) New insights on the differential interaction of
Freund C, Kühne R, Yang H, Park S, Reinherz EL, Wagner G (2002) Dynamic sulfiredoxin with members of the peroxiredoxin family revealed by protein-
interaction of CD2 with the GYF and the SH3 domain of compartmentalized protein docking and experimental studies. Eur J Pharmacol 954:175873
effector molecules. EMBO J 21:5985–5995 Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL,
Fujiki Y, Matsuzono Y, Matsuzaki T, Fransen M (2006) Import of peroxisomal Tosatto SCE, Paladin L, Raj S, Richardson LJ et al (2021) Pfam: the protein
membrane proteins: the interplay of Pex3p- and Pex19p-mediated interactions. families database in 2021. Nucleic Acids Res 49:D412–D419
Biochim Biophys Acta 1763:1639–1646 Miyata T, Miyazawa S, Yasunaga T (1979) Two types of amino acid substitutions
Fujiki Y, Okumoto K, Honsho M, Abe Y (2022) Molecular insights into in protein evolution. J Mol Evol 12:219–236
peroxisome homeostasis and peroxisome biogenesis disorders. Biochim Mo X, Niu Q, Ivanov AA, Tsang YH, Tang C, Shu C, Li Q, Qian K,Wahafu A, Doyle
Biophys Acta Mol Cell Res 1869:119330 SP, Cicka D, Yang X, Fan D, Reyna MA, Cooper LAD, Moreno CS, Zhou W,
Henrie A, Hemphill SE, Ruiz-Schultz N, Cushman B, DiStefano MT, Azzariti D, Owonikoko TK, Lonial S, Khuri FR, Du Y, Ramalingam SS, Mills GB, Fu H
Harrison SM, Rehm HL, Eilbeck K (2018) ClinVar Miner: demonstrating utility (2022) Systematic discovery of mutation-directed neo-protein-protein
of a Web-based tool for viewing and filtering ClinVar data. Hum Mutat interactions in cancer. Cell 185:1974–1985
39:1051–1060 Mosca R, Céol A, Aloy P (2013) Interactome3D: adding structural details to
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng protein networks. Nat Methods 10:47–53
9:90–95 Mosca R, Céol A, Stein A, Olivella R, Aloy P (2014) 3did: a catalog of domain-
Huttlin EL, Bruckner RJ, Navarrete-Perea J, Cannon JR, Baltier K, Gebreab F, Gygi based interactions of known three-dimensional structure. Nucleic Acids Res
MP, Thornock A, Zarraga G, Tam S et al (2021) Dual proteome-scale 42:D374–9
networks reveal cell-specific remodeling of the human interactome. Cell O’Reilly FJ, Graziadei A, Forbrig C, Bremenkamp R, Charles K, Lenz S, Elfmann C,
184:3022–3040.e28 Fischer L, Stülke J, Rappsilber J (2023) Protein complexes in cells by AI-
Jehl P, Manguy J, Shields DC, Higgins DG, Davey NE (2016) ProViz-a web-based assisted structural proteomics. Mol Syst Biol 19:e11544
visualization tool to investigate the functional and evolutionary features of ORFeome Collaboration (2016) The ORFeome Collaboration: a genome-scale
protein sequences. Nucleic Acids Res 44:W11–5 human ORF-clone resource. Nat Methods 13:191–192
Johansson-Åkhe I, Mirabello C, Wallner B (2021) Interpeprank: assessment of Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M,
docked peptide conformations by a deep graph network. Front Bioinform Müller A, Nothman J, Louppe G et al (2012) Scikit-learn: Machine Learning in
1:763102 Python. arXiv
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Persson E, Sonnhammer ELL (2023) InParanoiDB 9: ortholog groups for protein
Tunyasuvunakool K, Bates R, Žídek A, Potapenko A et al (2021) Highly domains and full-length proteins. J Mol Biol 435:168001
accurate protein structure prediction with AlphaFold. Nature Pozzati G, Zhu W, Bassot C, Lamb J, Kundrotas P, Elofsson A (2022) Limits and
596:583–589 potential of combined folding and docking. Bioinformatics 38:954–961
Krystkowiak I, Davey NE (2017) SLiMSearch: a framework for proteome-wide Schmidt F, Treiber N, Zocher G, Bjelic S, Steinmetz MO, Kalbacher H, Stehle T,
discovery and annotation of functional modules in intrinsically disordered Dodt G (2010) Insights into peroxisome function from the structure of PEX3 in
regions. Nucleic Acids Res 45:W464–W469 complex with a soluble fragment of PEX19. J Biol Chem 285:25410–25417
Kumar M, Michael S, Alvarado-Valverde J, Mészáros B, Sámano-Sánchez H, Zeke Sobti M, Mead BJ, Stewart AG, Igreja C, Christie M (2023) Molecular basis for
A, Dobson L, Lazar T, Örd M, Nagpal A et al (2022) The Eukaryotic Linear GIGYF–TNRC6 complex assembly. RNA 29:724–734
Motif resource: 2022 release. Nucleic Acids Res 50:D497–D508 Teufel F, Refsgaard JC, Kasimova MA, Deibler K, Madsen CT, Stahlhut C,
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic Grønborg M, Winther O, Madsen D (2023) Deorphanizing peptides using
character of a protein. J Mol Biol 157:105–132 structure prediction. J Chem Inf Model 63:2651–2655
96 Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 © The Author(s)
124
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
Chop Yan Lee et al Molecular Systems Biology
Tompa P, Davey NE, Gibson TJ, Babu MM (2014) A million peptide motifs for the a PhD stipend from IMB’s collaborative research initiative. JKV was supported
molecular biologist. Mol Cell 55:161–169 by the European Union’s Horizon 2020 UBIMOTIF programme (860517). This
Trepte P, Kruse S, Kostova S, Hoffmann S, Buntru A, Tempelmeier A, Secker C, work was supported, in whole or in part, by the Israel Science Foundation,
Diez L, Schulz A, Klockmeier K et al (2018) LuTHy: a double-readout founded by the Israel Academy of Science and Humanities (grant number 301/
bioluminescence-based two-hybrid technology for quantitative mapping of 2021 to OS-F).
protein-protein interactions in mammalian cells. Mol Syst Biol 14:e8071
Trepte P, Secker C, Choi SG, Olivet J, Ramos ES, Cassonnet P, Golusik S, Zenkner Author contributions
M, Beetz S, Sperling M et al (2021) A quantitative mapping approach to Chop Yan Lee: Data curation; Formal analysis; Investigation; Visualization;
identify direct interactions within complexomes. Preprint at BioRxiv https:// Methodology; Writing—original draft; Project administration; Writing—review
doi.org/10.1101/2021.08.25.457734 and editing. Dalmira Hubrich: Data curation; Formal analysis; Investigation;
Trepte P, Secker C, Kostova S, Maseko SB, Choi SG, Blavier J, Minia I, Ramos ES, Visualization; Methodology; Writing—original draft; Writing—review and
Cassonnet P, Golusik S et al (2023) AI-guided pipeline for protein-protein editing. Julia K Varga: Data curation; Formal analysis; Investigation;
interaction drug discovery identifies a SARS-CoV-2 inhibitor. Preprint at Visualization; Writing—original draft; Writing—review and editing. Christian
BioRxiv https://doi.org/10.1101/2023.06.14.544560 Schäfer: Data curation; Investigation; Methodology. Mareen Welzel:
Tsaban T, Varga JK, Avraham O, Ben-Aharon Z, Khramushin A, Schueler-Furman Investigation. Eric Schumbera: Methodology. Milena Djokic: Data curation.
O (2022) Harnessing protein folding neural networks for peptide-protein Joelle M Strom: Formal analysis; Investigation; Visualization. Jonas Schönfeld:
docking. Nat Commun 13:176 Investigation. Johanna L Geist: Investigation. Feyza Polat: Investigation. Toby J
Van Roey K, Gibson TJ, Davey NE (2012) Motif switches: decision-making in cell Gibson: Resources; Supervision; Writing—review and editing. Claudia Isabelle
regulation. Curr Opin Struct Biol 22:378–385 Keller Valsecchi: Supervision; Funding acquisition; Investigation; Writing—
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, review and editing. Manjeet Kumar: Resources; Formal analysis; Methodology;
Stroe O, Wood G, Laydon A et al (2022) AlphaFold Protein Structure Writing—review and editing. Ora Schueler-Furman: Conceptualization;
Database: massively expanding the structural coverage of protein-sequence Supervision; Funding acquisition; Writing—original draft; Writing—review and
space with high-accuracy models. Nucleic Acids Res 50:D439–D444 editing. Katja Luck: Conceptualization; Data curation; Formal analysis;
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Supervision; Funding acquisition; Investigation; Visualization; Methodology;
Burovski E, Peterson P, Weckesser W, Bright J et al (2020) SciPy 1.0: Writing—original draft; Project administration; Writing—review and editing.
fundamental algorithms for scientific computing in Python. Nat Methods
17:261–272 Disclosure and competing interest statement
Waskom M (2021) seaborn: statistical data visualization. JOSS 6:3021 The authors declare no competing interests.
Weatheritt RJ, Jehl P, Dinkel H, Gibson TJ (2012) iELM-a web server to explore
short linear motif-mediated interactions. Nucleic Acids Res 40:W364–W369 Open Access This article is licensed under a Creative Commons Attribution 4.0
Zhao G, Li K, Li B, Wang Z, Fang Z, Wang X, Zhang Y, Luo T, Zhou Q, Wang L International License, which permits use, sharing, adaptation, distribution and
et al (2020) Gene4Denovo: an integrated database and analytic platform for reproduction in any medium or format, as long as you give appropriate credit to
de novo mutations in humans. Nucleic Acids Res 48:D913–D926 the original author(s) and the source, provide a link to the Creative Commons
licence, and indicate if changes were made. The images or other third party
Acknowledgements material in this article are included in the article’s Creative Commons licence,
We thank all members of the Luck, Gibson, and Schueler-Furman labs as well unless indicated otherwise in a credit line to the material. If material is not
as Julian König and Anton Khmelinskii for helpful discussions and input. We included in the article’s Creative Commons licence and your intended use is not
thank Izabella Krystkowiak and Norman Davey for helping us access the permitted by statutory regulation or exceeds the permitted use, you will need to
SLiMSearch resource with an API. We thank Fridolin Kielisch for advice on obtain permission directly from the copyright holder. To view a copy of this
statistical analysis as well as the media lab and protein production core licence, visit http://creativecommons.org/licenses/by/4.0/. Creative Com-
facilities of IMB. Support from IMB’s IT department and especially help from mons Public Domain Dedication waiver http://creativecommons.org/public-
Christian Dietrich for local installations of AlphaFold is gratefully domain/zero/1.0/ applies to the data associated with this article, unless
acknowledged. The GPU cluster on which part of the AlphaFold predictions otherwise stated in a credit line to the data, but does not extend to the graphical
were performed was funded by the Ministry of Science and Health (MWG), or creative elements of illustrations, charts, or figures. This waiver removes legal
Rhineland Palatinate (funding ID: TB-Nr.:3658/19). We are very thankful for barriers to the re-use and mining of research data. According to standard
support from EMBL IT Services and the HPC resources for running AlphaFold scholarly practice, it is recommended to provide appropriate citation and
predictions for this project. This work is funded by the Deutsche attribution whenever technically possible.
Forschungsgemeinschaft (DFG, German Research Foundation) – Project-IDs LU
2568/1-1 and SFB1551 Project No 464588647 awarded to KL. JS acknowledges © The Author(s) 2024
© The Author(s) Molecular Systems Biology Volume 20 | Issue 2 | February 2024 | 75 –97 97
125
Downloaded from https://www.embopress.org on August 16, 2024 from IP 2a02:3102:4122:c:49b4:6f2f:b500:6ddf.
2.4.1 Supplementary material
126
Appendix 
 
Systematic discovery of protein interaction interfaces using 
AlphaFold and experimental validation 
 
Chop Yan Lee1,†, Dalmira Hubrich1,†, Julia K. Varga2,†, Christian Schäfer1, Mareen 
Welzel1, Eric Schumbera3, Milena Đokić1, Joelle M. Strom1, Jonas Schönfeld1, 
Johanna L. Geist1, Feyza Polat1, Toby J. Gibson4, Claudia Isabelle Keller Valsecchi1, 
Manjeet Kumar4, Ora Schueler-Furman2,*, Katja Luck1,** 
 
Affiliations 
1 Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. 
2 Department of Microbiology and Molecular Genetics,Institute for Biomedical Research 
Israel-Canada, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 
9112001, Israel. 
3 Institute of Molecular Biology (IMB) gGmbH, 55128 Mainz, Germany. Current address: 
Computational Biology and Data Mining Group Biozentrum I 55128 Mainz, Germany. 
4 Structural and Computational Biology Unit, European Molecular Biology Laboratory, 
Heidelberg, 69117, Germany. 
 
*Corresponding author. Tel: +972-2-675-7094, E-mail: ora.furman-schueler@mail.huji.ac.il 
**Corresponding author. Tel: +49-(0)6131-3921440, E-mail: k.luck@imb-mainz.de 
†These authors contributed equally to this work. 
 
Table of content 
Appendix Text S1. Summary of observations from the manual inspection of AlphaFold 
models generated from fragmentation approach on PPIs connecting NDD proteins. 
Appendix Figure S1. Benchmarking of AF on DMI interfaces using minimal interacting 
regions. 
Appendix Figure S2. Benchmarking and application of AF for DMI interface prediction using 
minimal interacting fragments. 
Appendix Figure S3. Effect of protein fragment extensions on the accuracy of AF 
predictions. 
Appendix Figure S4. Effect of protein fragment extensions on the accuracy of AF 
predictions. 
127
Appendix Figure S5. Comparison of AF v2.2 and v2.3 prediction performance. 
Appendix Figure S6. Performance of different metrics derived from structural models when 
benchmarking AF v2.3 for DMI predictions. 
Appendix Figure S7. Expression and BRET50 plots for TRIM37-PNKP and ESRRG-
PSMC5. 
Appendix Figure S8. Structural models, expression, and BRET50 plots for STX1B-FBXO28 
and STX1B-VAMP2. 
Appendix Figure S9. Structural models, expression, and BRET50 plots for PEX3-PEX19 
and PEX3-PEX16. 
Appendix Figure S10. Expression and BRET50 plots for SNRPB-GIGYF1. 
 
 
128
Appendix Text S1. Summary of observations from the manual inspection of AlphaFold 
models generated from fragmentation approach on PPIs connecting NDD proteins. 
 
run14: PLP1-MFF 
Top prediction involves an ordered region from PLP1 and a disordered fragment from MFF, 
with a model confidence of 0.75. Looking at the predicted model, the peptide is tilted at an 
angle to the bundle of helices of PLP1, not like the usual coiled-coil interaction. No trend in 
increasing confidence with shorter fragments too. The interface does not look very convincing. 
While the disordered region in MFF is likely to be a functional motif, the 4-helix bundle domain 
in PLP1 that AF models it to bind to is known to be a transmembrane domain, so the binding 
site is actually buried inside the membrane. AF is also not very confident about the domain 
structure, especially for the parts that are at the membrane surface or outside of it. The 
prediction is likely wrong. 
 
run17: PAX6-CSNK2A1 
CSNK2A is a widely active kinase, involved in many processes. Overlapping fragments from 
PAX6 show trend of increasing confidence the shorter the fragment. CSNK2A1 is predicted to 
bind with its kinase domain (it doesn’t really has anything else than the kinase domain) to a 
peptide in PAX6 which seems to be a good looking linear motif, i.e. conserved, not part of a 
folded domain as predicted by AF and predicted by AF to form an alpha helix. The motif though 
overlaps with a putative NLS. The PAX6 motif is predicted to bind clearly to a pocket that 
exists in N-lobe of the kinase domain at the bottom of it, away from the catalytic side. Digging 
deeper, I found a structure, 1JWH, that shows that this is the pocket that is bound by CSNK2B, 
the regulatory subunit, that interacts with the catalytic subunit to form an active holoenzyme. 
This, however, does not eliminate the possibility that the AF prediction is right since the peptide 
looks like a functional motif. 
 
run18: PAX6-SET 
Top prediction is ordered-ordered, PAX6 Homeodomain and SET NAP domain. The structure 
6PAX shows the PAX domain consisting of two similar folds like the homeodomain bound to 
DNA but the three-helix bundles are not oriented in exactly the same way like in the 
homeodomain so I am having a hard time to see where the homeodomain would bind DNA; 
AF models the homeodomain interface with the NAP domain of SAP via a charged interface 
with a lot of positively charged residues on the homeodomain contacting a patch of negatively 
charged residues on the SAP domain. It could be that this patch of positively charged residues 
on the homeodomain would usually interact with the negatively charged backbone of DNA, 
but the predicted structure from AF looks interesting since the interface likely does not interfere 
with SET homodimerization (2E50). 
 
run19: PAX6-TLK2 
All predictions with >0.7 model confidence are paired with the Pkinase domain of TLK2 and 
they are all predicted to bind at the bottom of the beta barrel fold (N-lobe) of the kinase domain. 
However, almost all peptides come from very different regions in PAX6, no recurrent 
predictions here. 
When looking at the motif pLDDT metric then top predictions also involve two distinct 
motifs predicted to bind to the long helices in TLK2. However, AF predicts the two helices to 
form intramolecular contacts. By taking them apart into separate fragments it could be that 
intramolecular contact sites are now used for interface prediction.  
129
The pair of interactions has a DMI predicted, MOD_GSK3_1 (PAX6 395-402). The 
peptide PAX6 394-404 was paired with the Pkinase domain but similar to the previous point, 
it is also put at the beta barrel fold in the N lobe and not the substrate binding site. 
 
run20: PAX6-NGLY1 
The PUB domain from Q96IV0, NGLY1, gives good model confidence, >0.8, in binding 
overlapping disordered fragments of P26367, PAX6. The PUB domain has been solved before 
alone (2CCQ), the catalytic domain has also been solved bound to RAD23 (2F4M); in the 
paper that published the PUB domain structure (Allen et al JBC 2006, 
10.1074/jbc.M601173200) they also did some mutational analysis to show that there is an 
interface on the PUB domain that binds the AAA ATPase domain of p97 but the experimental 
evidence looks not very convincing. Indeed, AF modelled the peptide from PAX6 to bind to an 
interface adjacent to the one found by Allen et al. There is indeed some hydrophobic pocket 
and the best 4 predictions comprise that peptide binding to this pocket, however, which 
hydrophobic residue of the peptide is docked into the pocket varies depending on the length 
of the peptide; I think that this region in PAX6 could indeed be a linear motif, it is adjacent to 
the homeobox domain but I don’t think that it is part of the homeobox domain. 
 
run21: PAX6-ESRRG 
Many short fragments with high model confidence that are scattered over the disordered 
region. The binding pocket on ESRRG is in the hormone receptor domain and is a known 
binding pocket for binding to L..LL motifs (ELMDB: LIG_NRBOX). 
According to ELMDB, the first and last L go into a hydrophobic pocket and all fragments 
of PAX6 with high model confidence have more or less two hydrophobic amino acids with 
three residues in between: PAX6 319-329: DTALTNTYSA, PAX6 203-213: RLQLKRKLQR, 
PAX6 374-384: PPHMQTHMNS, PAX6 198-208: DEAQMRLQLK, PAX6 128-148: 
GADGMYDKLR. 
Looking at structures with ESRRG and two different bound peptides: 1KV6 and 1TFC: 
NCOA1 686-700: RHKILHRLLQEGSPS, 2GPO and 2GPP: NRIP1 378-387: SLLLHLLKSQ, 
it furthermore became apparent that the hydrophobic residues right before both Leucines are 
also important for binding since they contact a hydrophobic patch on the other side of the 
pocket. However, none of the AlphaFold predicted motifs really fit, it is thus questionable 
whether they can actually bind the pocket. 
Structurally speaking, the peptide does not fit that nicely in the hydrophobic pocket. In 
2GPO and 2GPV, there is a triad of hydrophobic residues (V/L/I) making contact with the 
hydrophobic pocket on the domain but here only 2 residues are making contact. Therefore, it 
seems doubtful to me that this is a motif that can bind to the domain. 
 
run22: PAX6-QRICH1 
Difficult to dig deeper because QRICH1 has only one domain (DUF) that binds to C terminus 
peptide from PAX6. The high confidence peptide is 20 aa long and seems nice with 0.88 model 
confidence.  
The same DUF is also modelled with 0.76 confidence with a very long disordered 
region (85 aa) that is at the N terminus of PAX6. However, the predicted complex of this 
disordered region is quite odd, as it has many twists and turns that seem weird to me. 
Overall, these predictions look good but it’s hard to be very certain about it because 
nothing is known about the domain in QRICH1 and PAX6 has a long disordered C-terminal 
region full of S, T, but also some Ps and hydrophobics. 
130
 
run23: PAX6-KCTD7 
The top prediction involves the disordered region of PAX6 (198-208) and BTB_2 domain of 
KCTD7, with 0.74 model confidence. No trend of increasing confidence when fragments 
shorten. InterPro describes this domain as one that multimerises for its protein function, e.g. 
KCTD1 as a transcriptional repressor (3DRX, solves KCTD5 that has a similar fold but shorter 
in length). Since BTB domain mediates the multimerisation of KCTD, it could be that it requires 
a certain stoichiometry for binding to its partner. In the HuRI database, KCTD7 was indeed 
detected to interact with itself. The two highest predicted models put both peptides into the 
same pocket and both peptides have some sequence similarity albeit from different regions in 
PAX6. These peptides were also predicted with high model confidence in other runs. Based 
on the structure 5FTA, BTB domains in their homodimerized form do expose the surface 
predicted in the top prediction. Therefore, the surface predicted to bind to the peptide would 
be available. Taken together, the prediction looks plausible. 
 
run24: TTC19-FH 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run25: PEX3-PEX16 
PEX3 and PEX16 are two proteins that seem to cooperate to help inserting new peroxisome 
membrane proteins (PMPs) into the peroxisome membrane. They do so via interaction 
between PEX3 and PEX19. PEX19 brings the PMPs to the peroxisome where PEX3 and 
PEX16 sit and mediate then further insertion of the cargo (this is described in review Smith 
and Aitchison 2013 Nature Rev Mol Cell Biol in Fig 2). However, there is also a study that 
describes how PEX16 localizes to ER and from there traffics to peroxisomes (Kim et al JCB 
2006). The structure 6AJB that has been solved for the interaction between PEX3 and PEX19 
was published by Sato et al EMBO J 2010 and describes how an N-terminal SLiM in PEX19 
binds to the domain of PEX3. They tried to crystalize the whole protein of PEX3 but only 
observed residues 52-368. The domain has the exact same fold as predicted by AF. The 
predicted cytosolic and peroxisomal localization of protein regions and the two TM helices that 
are shown in Uniprot seem to be wrong for PEX3 according to work cited in Sato et al. They 
summarize that the N-terminal region of PEX3 contains a targeting signal or anchor for PEX3 
to the peroxisomal membrane followed by the domain that is located in the cytosol. No 
structure has been solved yet for PEX16 but it seems likely that the prediction of two TM 
helices that are shown in UniProt in this protein is also wrong. AF predicts a globular domain 
containing the two TM helices and has a nicely exposed loop that carries the putative SLiM 
that AF predicted to bind to PEX3. It binds onto PEX3 on the opposite side to PEX19 binding, 
so PEX19 and PEX16 could bind simultaneously to PEX3. Further work on these interactions 
can be done by submitting the three protein sequences to AF to see what it does. Some other 
study observed interaction between PEX3 and PEX16 according to Uniprot but the interface 
really does not seem to have been looked at before nor the interaction studied in detail. All the 
fragments that contain the putative SLiM in PEX16 are predicted in the exact same way to 
bind to PEX3; always anchored via a conserved region sitting in PEX16 between residue 160 
and 190. Interestingly, the most conserved residues are also those that seem most important 
for binding. This smells really good. 
 
131
run26: PEX3-PEX19 
This is a positive control interaction since the structure has been solved for this PPI (3AJB) 
and it is a well known and well studied PPI with an entry for it in the ELM DB: LIG_Pex3_1 
(L..LL...L..F). This ELM instance is indeed predicted by AF to be the highest model confidence. 
Another peptide from PEX19 121-141, FTSCLKETLSGL, scored equally high model 
confidence. It could be that this other predicted binding site is also true but I believe that it is 
rather an artefact from AF’s insensitivity to mutations. 
 
run27: GABARAPL2-UBA5 
The structure 6H8C shows binding of GABARAPL2 domain to LIR motif in UBA5. This motif 
is not listed on the ELM website for LIG_LIR_Gen_1 because it does not quite fit the regular 
expression which seems to be defined too narrowly. AlphaFold correctly predicts this interface 
but only as third highest based on model confidence just hitting the cutoff of 0.7 while using 
chainB_inf_avg_plddt it scores as fourth best prediction far below the cutoff (67). However, 
AF recurrently finds peptides including the motif following each other when ranked by model 
confidence or pLDDT. The top three motifs predicted to bind to GABARAPL2 are not finding 
the hydrophobic pocket that is filled by a key big hydrophobic residue in the motif and these 
peptides are also not recurrently predicted. So, I think these are wrong predictions. 
 
run28: GABARAPL2-LZTR1 
GABARAPL2 (P60520) has Atg8 domain that is known to bind motifs (LIG_LIR_Gen_1). The 
domain is modelled with high confidence to bind to different disordered fragments of 
interacting partner LZTR1 (Q8N653). The second top confident model (when ranked by model 
confidence) has an aromatic residue tucked into a deep pocket and a branched alipathic 
residue tucked into another shallow pocket. The top confident model has some kind of 
increasing trend in model confidence as fragments get shorter, with the shortest one getting 
the highest confidence. The highest confidence model has a nice increasing model confidence 
trend but it does not have an aromatic residue fitting into the deep pocket as it is known for 
LIG_LIR motifs. 
Looking at the structure 2LUE, the second top model LZTR1 46-52 GPFETVH looks 
more similar in sidechain positioning compared to 2LUE. Residues highlighted in bold get 
tucked into the mentioned pockets. This model seems more likely to be true than the best 
model. However, it also is predicted to bind in reverse order compared to structure 3WIM. 
 
run29: CUL3-KCTD7 
Has an ordered-ordered prediction with quite high confidence (0.66) but the contact interface 
is a tetramerization domain from KCTD7. Therefore it seems unlikely that it is a functional 
interface. 
Two N terminus disordered fragments from KCTD7 with > 0.7 model confidence when 
paired with the Cullin domain of CUL3. These two fragments are modelled to be binding at the 
same site of Cullin domain (the site where RING proteins bind to, 1LDJ). In the case of 1LDJ, 
the RING protein has a long disordered region inserted into the Cullin domain of CUL1, burying 
a series of hydrophobic residues in the long disordered region. However, the same binding 
site of the Cullin domain of CUL3 is a bit different, with more surface exposed than CUL1. In 
this case, the contacts modelled in KCTD7 16-26, with a triple Serine making contact with the 
Cullin domain, look plausible. The other high confidence peptide KCTD7 1-11, with triple 
Valine making contact with the Cullin domain, also looks plausible to me. 
132
In the structure of 1LDJ it is really amazing how the partner protein interacts with CUL1 
via beta-sheet augmentation but how this extra beta strand becomes part of the integral fold, 
it is kind of in the middle of the domain. I think AlphaFold feels that there is something missing 
and is trying to put a peptide there but the overall conformation of the domain is also different 
at places so that the predicted peptide does not sit at the same position like the one shown in 
1LDJ. AlphaFold predicts two different motifs of very different sequence from the N-terminus 
of KCTD7 to bind there. Given how different the sequences are, this adds another negative 
point towards questioning the specificity of these predictions. 
 
run30: PNKP-SYP 
Top prediction is a disordered fragment from SYP (7-19) paired with the kinase domain of 
PNKP. The binding surface is different from the nucleotide binding surface (1RC8). This 
binding interface looks plausible. It was later found that the kinase and phosphatase domain 
form a structural unit based on published structures. The run is modified to use the kinase and 
phosphatase domain as an ordered region for prediction with disordered fragments of SYP. 
The rerun with a fragment comprising the phosphatase and kinase domain now 
resulted in one prediction that makes the cutoff. This prediction puts a motif from SYP into the 
DNA binding pocket of the kinase domain (according to Bernstein et al Mol Cell 2005, 1RC8).  
There is another predicting docking a peptide from SYP into the FHA domain of PNKP. 
It puts it where FHA domains bind their phosphorylated peptides but the SYP peptide has no 
Ser or Thr. 
 
run31: PNKP-TRIM37 
The first prediction involving the combined kinase-phosphatase structure puts a peptide of 
TRIM37 into the binding pocket where the phosphatase domain would bind single stranded 
DNA. 
Following up is a prediction that involves a disordered region in PNKP binding to the 
surface of MATH domain of TRIM37 where MATH domain-binding peptides generally bind to. 
The PNKP peptide differs slightly in sequence from regular expression patterns described for 
MATH domains in the ELM database. This peptide in PNKP has a known phosphorylation site 
that stabilizes PNKP protein levels, making the peptide very interesting since this suggests a 
regulatory role of phosphorylation on the peptide. 
There is a second peptide of PNKP predicted to bind to the MATH domain also with 
high confidence but the sequence is quite different from the first one and very close to the 
phosphatase domain. There is also a prediction where the FHA domain of PNKP is predicted 
to bind to a peptide of TRIM37 but the peptide looks very different from known FHA-binding 
motifs (peptide with phosphorylated threonines), which is of course difficult to predict for AF. 
 
run32: PNKP-XRCC4 
XRCC4 and PNKP prediction, there is a peptide from XRCC4 that binds to the phosphatase 
domain with high confidence. But then I am not sure if this is right because it could be a false 
prediction of a small peptide easily fitting into the catalytic site of the phosphatase domain. 
There is a Serine in the peptide, so it is possible that this is where the phosphate group gets 
cleaved off by the phosphatase domain. After checking more, it is found that XRCC4 is known 
to bind to PNKP via a phosphorylated motif that binds to the FHA domain in PNKP. 
In principle, it would be better to make a rerun where the kinase and phosphatase 
domain are taken as one fragment since they form 1 structural unit but I think in this case it 
would not have changed anything. The best prediction put a peptide from XRCC4 into the 
133
pocket of the phosphatase domain where it would bind the single-stranded DNA as seen in 
3U7G. Among the first 9 predictions AF put 7 different peptides from XRCC4 into the 
phosphatase, the others go to the kinase domain. The first prediction that involves the FHA 
domain of PNKP and contains the FHA-binding motif in the sequence fragment of XRCC4 has 
a confidence score of 60 and does not put the FHA-binding motif in the pocket but another 
negatively charged peptide in the sequence (the FHA pocket is very positively charged). The 
correct prediction where AF puts the FHA-binding motif in the right pocket has a confidence 
score of 0.58. 
 
run33: TNPO3-GCH1 
Top prediction involves the disordered region of GCH1 (16-26) and the superhelical structure 
of TNPO3, with model confidence 0.71. Since TNPO3 (transportin) is known to transport cargo 
into the nucleus by releasing the cargo via the competitive binding of GTP-bound Ran (2X19), 
the peptides from GCH1 are modelled to be at a binding site near where Ran binds in 2X19. 
It is therefore biologically sound where the peptides are modelled at. The binding site of the 
peptides from GCH1 is also lined with many arginines, making it very positively charged. The 
contact modelled by AF in the top prediction looks good, with many charge-charge interactions 
at the interface. The N terminus of GCH1 has many prolines that are conserved, with three 
repeats of PAEK or PEAK and two repeats of PPRP. 
 
run34: TNPO3-CAMK2G 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run35: GNAI3-GPSM2 
This interaction has been structurally solved (4G5S) and AlphaFold predicted the interface 
100% accurately. GPSM2 has multiple GoLoco motifs that AlphaFold predicts individually with 
high confidence to bind in the pocket on GNAI3. 
 
run36: SYT1-MIP 
Both are transmembrane proteins. The top prediction involves the linker between two C2 
domains of SYT1 and the MIP domain of MIP. MIP domain is also known as aquaporin domain 
(transmembrane). However, when the linker is fragmented, it receives lower confidence. I think 
this is unlikely to be the interaction interface. The linker could be a motif for some other 
interaction because of its moderately high plddt. There is a structure of a homodimer of SYT1, 
2R83, that shows that both C2 domains of one chain are actually interacting with each other 
and that the linker between both domains interacts with one domain. It is this linker where AF 
predicts that a peptide would bind to the porin domain of MIP; interestingly, AF predicts the 
two C2 domains to be independent from each other in the monomeric structure of SYT1, so 
either AF is wrong or crystallization introduced the packing of both domains against each other 
but I would rather believe the Xray structure and in this case the peptide would not be 
accessible to bind to the MIP domain. 
 
run37: FTSJ1-CERT1 
FTSJ domain of FTSJ1 is known to bind S-ADENOSYLMETHIONINE (see structure 1EJ0). 
The top predictions all look very different in that different regions or partially overlapping 
regions of CERT1 are docked into different sites onto the FTSJ domain. Sometimes the 
134
peptide is docked into the catalytic pocket where the protein methylates adenosines on tRNAs 
but the peptide is also docked elsewhere. Because of these ambiguities, I believe that the 
predictions are questionable since they seem to lack specificity but I don’t think we can call 
them definitely wrong.  
Another interface was found involving CERT1 368-388, with model confidence 0.70. 
However, the contacts modelled are mostly backbone-to-backbone. I have previously noticed 
that AF tends to give higher confidence to complex modelled with secondary structure. So I 
think this is also a likely false interface. 
 
run38: CAMK2A-SOX5 
Kinase domain of CAMK2A with a disordered fragment predicted showed high confidence. 
The structure predicted by the highest confidence model is weird, with both beta sheet and 
helix structure. 
 Kinase domain of CAMK2A is likely serine threonine kinase and in kinase domain 
prediction, one has to be careful with the two lobes that bind substrate and ATP. It might be 
interesting to check other high scoring peptide to see if they have S/T that can be 
phosphorylated and check the crystal structure to find substrate binding pocket. The first two 
highest scoring peptides do not look convincing because the first one has no S/T in the peptide 
but it is fit into the catalytic cleft while the second one has positioned the sidechain of a T out 
of the cleft. The third highest scoring peptide (P35711 131-141) looks nice because it positions 
the sidechain of an S into the catalytic cleft. 
 The highest ranking peptides are essentially all over the place from SOX5 and I don’t 
think that AF can predict very well kinase-substrate interactions. Overall, the high-scoring 
predictions all do not look very convincing. 
 
run39: CAMK2A-CAMK2G 
Many high confidence predictions involving different regions in the protein pair. Among them, 
one ordered-ordered interface gives a really high confidence. The interface is a known DDI in 
3did with high zscore. The structure 3SOA only shows one CAMK2A monomer but the 
publication talks about a dodecamer for which one can download a model from the PDB as 
well. Looking at this dodecamer and the paper, it becomes clear that downstream of the kinase 
domain there is another domain referred to as hub domain in the paper which mediates 
oligomerization, together with the linker between the kinase and hub domain. The best AF 
prediction for the interaction between CAMK2A and CAMK2G involves both hub domains and 
is an accurate prediction of the interface seen in the dodecamer. 
The second best prediction made by AF involves the hub domain and a bit of the linker 
sequence from the other partner. Looking at the dodecamer, one can see that where the 
peptide is predicted to bind on the hub domain is part of the linker sequence bound from the 
same monomer, so an intra-molecular interaction. So, there is indeed some binding site but 
not for inter-molecular interaction. Because the linker sequences are different in the structure 
and canonical uniprot sequence it is very difficult to know which part of the linker is binding on 
the hub domain and whether this corresponds to the bit of the linker sequence predicted by 
AF to bind there. In the paper accompanying the 3SOA structure they also investigate how 
different linker sequences from different isoforms influence Ca-binding site accessibility and 
thus activation of the complex. There is evidence from 3 other studies that CAMK2G and 
CAMK2A interact with each other from co-IP experiments but these were large-scale studies. 
It is likely that no one has studied the interface between CAMK2G and CAMK2A and thus 
would be something new. 
135
 
run40: ACTB-ACTG1 
Two actin proteins are predicted to have high confidence DDI. The interface itself that is 
predicted by AlphaFold looks very interesting, it indeed looks like a polymerization interface 
because both domains interact with opposite sites. interactome3D would model this interaction 
with the structure 4JHD as a template but this one looks quite different, it’s not the same 
interface and needs according to the authors a third protein for polymerization. Digging deeper 
in PDB for structures of ACTB, I found structure 6ANU which shows the same interface that 
AF predicted between ACTB and ACTG1, so the interface is probably right. 
This is also a very interesting case. Based on the review by Vedula and Kashina (J of 
Cell Signal 2018, 10.1242/jcs.215509), it is still an open question whether the different actin 
forms that exist in human can form heteropolymers or not. Some studies find this in vitro, other 
find intermingled homopolymers of beta and gamma actin. Both actins co-occur in many cell 
types while alpha-actin is more specifically expressed in muscle. It seems really tricky to solve 
this since actins are highly studied and actins are also super similar in their sequence, so it 
could be that in a somewhat artificial system, beta and gamma actin can interact because the 
interface residues are identical but in vivo they would rather not interact and rather form 
homopolymers. In the end, whether ACTB and ACTG1 indeed interact in vivo is the only open 
question. 
 
run41: RARB-PSMC5 
PSMC5 has been repeatedly modelled by AF to have a high confidence peptide that binds to 
partners with Hormone_recep domain. The peptide is 132-141 DPLVSLMMVE. Residues 
highlighted in bold are the ones tucked into the hydrophobic pocket. However, this peptide 
does not match with the consensus of LIG_NRBox (^PL..LL^P), especially in this peptide P 
precedes the first L. I am not sure why P is disallowed at first position as ELM has not 
described much about the sequence composition of the motif. I think it might be too early to 
reject this peptide because the highlighted residues are indeed hydrophobic and can serve 
similar functions as those in the regex. 
I looked at the HuRI network of PSMC5 too, and found that the interactors seem to be 
enriched with the Hormone_recep domain, making this interface even more plausible. 
 
run42: DCX-BICD2 
DCX has two DCX domains and all good predictions involve the N terminus DCX domain. The 
N terminus DCX domain is known to bind Tubulin. AF modelled a different interface on the N 
terminus DCX domain to bind to disordered fragments from BICD2.  
The DCX domains have a C-terminal part that is not confidently predicted by AF to be 
part of the fold. When excluding this part from the first DCX domain, AF models peptides to 
bind to the area where this last part is predicted to be located in the monomeric structure from 
AF. When we use a DCX domain that contains this last bit, then AF predicts other peptides 
from BICD2 to bind on the opposite side of DCX. There is no consistency in these predictions.  
There are no other predictions between ordered-ordered or disordered fragments 
binding to ordered domains in BICD2 that make the cutoffs. BICD2 however, also only consists 
of large helices. Nonetheless, it could be that both DCX domains together bind to one of these 
coiled coil helices in BICD2. 
 
run43: DCX-ZBTB10 
136
A possible prediction involves the first DCX domain of DCX and a peptide of ZBTB10 261-271. 
This prediction is not influenced by the actual domain boundaries because the peptide is not 
docked into the pocket where a region a little C-terminal of the domain might bind to. This is 
the case for the second best prediction involving the first DCX domain and peptide 604-614. 
According to chainB inf avg plddt these are the only two prediction that make the cutoff when 
looking at chainB as a disordered region. ZBTB10 has a lot of disorder and probably many 
motifs. DCX has two DCX domains and a bit of disorder. Looking into available PDB structures 
then the DCX domains are known to bind to microtubules. There is one structure with the first 
DCX domain bound to microtubules (6RFD). It seems though that the pocket where ZBTB10 
261-271 is predicted to bind is not occupied in this complex. AF does not predict slightly 
extended versions of this peptide with reasonable confidence to bind to this pocket. 
A peptide was also predicted to bind in beta-sheet augmentation to the last beta strand 
of the BTB domain with reasonable model confidence and chainA_intf_avg_plddt scores but 
the ZBTB10 model might have its own beta strand C-terminal of the current domain 
boundaries that AF predicted to complement the last beta strand of the domain as predicted 
in the full length model of ZBTB10. 
 AF also predicts a contact between the ZnF domain of ZBTB10 and the first DCX 
domain but it does not look very likely and I think the ZnF fold is perturbed. 
 
run44: PSMC5-ESRRG 
The interaction has quite some high confidence predictions. The highest scoring peptide is 
P62195, PSMC5, 132-141, DPLVSLMMVE. The three hydrophobic residues make nice 
contacts with the hydrophobic pocket and surface of the domain. Another disordered fragment 
from PSMC5 binding to the same domain, IKKLWK, also looks promising. However, there is 
some possibility that these are artefacts because AF is not very specific when it comes to 
detecting single mutation in known motifs. The sequence alignments are not helpful 
unfortunately because the whole PSMC5 is super conserved. 
 Nonetheless, interaction between PSMC5 and ESRRG looks promising because the 
alternative name is thyroid hormone receptor-interacting protein 1, TRIP1. 
 
run45: PSMC5-RORB 
The highest confidence prediction involves a disordered fragment from PSMC5 and it is the 
same as run44. The ordered region from RORB is the same domain, hormone receptor 
domain, as run44.  
It is interesting to see AF predicting similar DMI with high confidence from two different 
proteins. Same observation as run44. 
 
run46: WAC-NFE2L2 
WAC and NFE2L2 are largely disordered. WAC has a WW domain. AF predicts recurrently a 
sequence close to the N-terminus of NFE2L2 to bind to the WW domain that are known to 
bind proline-rich motifs. The putative motif in NFE2L2 does not contain prolines and is not 
docked onto the WW domain in any way like other WW domains, e.g. 1EG4. These are likely 
wrong predictions. While the motif interface pLDDT is reasonably high for these predictions, 
the model confidence does not reach the 0.6. There are no other predictions that make the 
cutoff. 
 
run47: WAC-MOBP 
137
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run48: STX1B-FBXO28 
The top model has 0.76 model confidence that utilizes the disordered region 1-22 of STX1B 
and Fbox + helix bundle domain (63-221) of FBXO28. The interface involves the disordered 
region of STX1B forming a 310 helix structure with the helices from the Fbox domain. Note 
that the Fbox domain annotated by InterPro is from 61-109, while the ordered region that I 
used for prediction is 63-221. The Fbox domain is known to mediate PPI but it is not used by 
AF to model the interaction in this prediction. Region 1-22 of STX1B is conserved only in 
recent homologs. The plddt of the disordered region is low, <60 for all residues. 
The second top model has 0.75 model confidence that involves the syntaxin domain 
(23-237) of STX1B and disordered region 354-368 of FBXO28. The disordered region of 
FBXO28 is at the C terminus and conserved. However, the plddt of the peptide is low and 
adopts a 310 helix kind of structure. A slightly different prediction involving fragments of the 
proteins (27-219 STX1B and 345-363 FBXO28) returned 0.73 model confidence. The peptide 
adopts a helical structure but is placed on a different surface of the syntaxin domain. Although 
the peptide 345-363 has good plddt (mosty >60), I am not sure if this is the right interface. One 
prediction pairs the full length of STX1B with the disordered region 354-368 of FBXO28 and 
returned 0.71 model confidence. The interface is similar to that of the syntaxin domain (23-
237) of STX1B and disordered region 354-368 of FBXO28 with low plddt. This region 354-368 
in FBXO28 could be an nuclear localization signal (NLS), where ELMDB also predicts quite a 
few NLS, and therefore unlikely to be the interface for the interaction. 
Next top prediction has 0.749 model confidence that involves the C terminus of the 
syntaxin domain (220-232) of STX1B as disordered region and the Fbox + helices domain of 
FBXO28 (63-221). The interface is formed by the peptide adopting a helical structure with the 
Fbox + helices domain. The plddt of the peptide is good, with all residues above 60 plddt. 
Nonetheless, another prediction involving slightly longer peptide from the same region of 
STX1B has a much lower model confidence (0.55). The interface modelled is not exactly the 
same as it is a little bit shifted. Unsure if this is a good interface. 
I tried to find more molecular studies on the two proteins but I can’t find much. STX1B 
is known to function in docking of synaptic vesicles at presynaptic active zones while FBXO28 
probably recognizes and binds to some phosphorylated proteins and promotes their 
ubiquitination and degradation. Weirdly, STX1B is known to localize to membrane while 
FBXO28 has not much information on subcellular localization but studies have shown that it 
interacts with topoisomerase using its Fbox domain (the bundled helices are not needed for 
interaction). Out of all the predictions, I think STX1B 27-219 + FBXO28 345-363 and STX1B 
220-232 + FBXO28 63-221 are most likely to be the interface, as their peptides are modelled 
with good plddt and both achieved model confidence higher than 0.7. 
 
run49: STX1B-MMGT1 
Top prediction involves the Syntaxin domain of STX1B and the disordered region of MMGT1 
(23-31) with confidence 0.73. A slightly longer fragment has a slightly lower confidence but 
looking at the structure, the two peptides have different angles to the Syntaxin helical bundle. 
Since the interfaces modelled by AF differ a lot despite using the same peptide and its 
extended counterpart, the modelled interfaces do not look genuine. 
 
138
run50: STX1B-VAMP2 
Interactome3D models an interface between both proteins based on the structure 3HD7/3IDP 
where STX1A interacts with VAMP2. STX1A and STX1B are very similar in structure. 
STX1B is predicted in closed conformation, which we know because structures exist 
of STX1A bound to Munc18 where it is in this closed conformation with the long C-terminal 
helix comprising the SNARE domain folding back onto the syntaxin domain. However, when 
bound to VAMP2 we can see the open conformation where the long helix is made available 
to bind in coiled-coil like manner to VAMP2 and SNAP25 helices.  
Based on this available structural information we designed different fragments of the 
extended SNARE domain of variable length. VAMP2 is a short protein of 116 residues 
consisting of a long helix and about 30 disordered residues at the N-terminus. The most 
confident predictions obtained for these fragments is the one modeling a coiled-coil interaction 
between the extended SNARE domain and the helix of VAMP2 but the model confidence is 
slightly below the cutoff. Predictions with the disordered N-terminal region of VAMP2 remain 
far below cutoffs. 
 
run51: CSNK2A1-CSNK2B 
Nice prediction with overlapping fragments showing increasing model confidence. This 
interface has been solved before in two structures: 4DGL and 6Q38; prediction is highly 
accurate, and is probably a DMI that is not in ELM yet. 
 
run52: EBF3-EBF2 
Dimerization of the EBF family already known and solved (3MUJ). AF predicts the middle 
domain of both proteins called TIG as the dimerization interface as top prediction but in head 
to tail orientation while the structure 3MUJ shows head to head orientation. Followed closely 
up in terms of score (avg_intf_plddt) is the fragment comprising the TIG domain and the helix 
loop helix domain which are predicted accurately as seen in the structure. 
 The third best prediction involves the N-terminal DNA binding domain as the 
dimerization interface. Does not look so convincing to me but still got a very high score. The 
fourth best prediction is the helix-loop-helix domain alone as dimerization interface, still with a 
score of 90. There are more predictions that make the cutoff that involve various disordered 
regions of either protein and ordered fragments from the other involving interfaces used for 
dimerization but I guess that these predictions are likely wrong. 
 
run53: PEX12-TREX1 
The disordered region of PEX12 215-312 (98 residues long) is predicted with high confidence. 
One fragment of it achieved even higher confidence but when this fragment is further 
fragmentate, their confidence is not as high anymore. After checking the protein on InterPro, 
this domain is the exonuclease domain of TREX1 that binds to ssDNA (2OA8). In this crystal 
structure, it shows the pocket modelled by AF to bind PEX12 215-312 is bound to a ssDNA, 
with the phosphodiester bond of ssDNA making interactions with the backbone of the domain 
chain and some hydrophobic side chain (leucine) making hydrophobic interaction with the 
base of the nucleotide. Interestingly, AF seems to have memorized this crystal structure 
because the bound ssDNA has a curved structure and AF also models the long disordered 
region to have an odd curve. I think this interface is unlikely to be true because the bound 
magnesium ions coordinate with the oxygen in the phosphodiester bond of ssDNA and the 
modelled helix places hydrophobic sidechains to the cavity where magnesium ions bind. 
139
A very short fragment of PEX12 12-16 at the N terminus is modelled with high 
confidence with a very negatively charged pocket in the domain of TREX1. It is unusual to 
have a peptide binding pocket with such a high negative charge. Further checking revealed 
that this domain binds magnesium ion and nucleotides. The short fragment fits into the 
magnesium binding pocket and thus this is unlikely to be true. 
 
run54: PRKAR1A-PRKAR1B 
Best model is an ordered-ordered prediction with 0.83 confidence. It is a homo-DDI (RIIa 
domain) dimerization and has been solved in 2EZW. 
An additional disordered fragment (PRKAR1A 360-372) predicted with high model 
confidence but low pLDDT with the cyclic nucleotide binding domain of PRKAR1B. 
Referencing available structure of cNMP binding domain (1NE4), there are two beta barrel 
folds in the domain that bind to cyclic nucleotides. AF fits the disordered fragment on a 
hydrophobic surface near the beta barrel but not in the cNMP binding pocket. Although this 
could be another binding site, the binding makes little sense to me because the disordered 
fragment is at the C terminus of cNMP binding domain of PRKAR1A, meaning that the 
sequence would have to loop back to make this contact. In the previous bullet point, it seems 
very likely that the dimerization of the two proteins are mediated by the RIIa domain (N 
terminus), so it seems not so plausible to me that at the C terminus they make contact again. 
This is likely a false positive interface. 
 
run55: ASF1A-H4C8 
The interaction between both proteins has been solved (5C3I). However, this structure shows 
that the motif in H4 sits at the very C-terminus and binds in beta sheet augmentation to ASF1A 
in the same pocket like AF predicted but using an N-terminal peptide of H4. I think the problem 
is that the C-terminal region of H4 was made part of the domain of H4, which I agree was hard 
to see from looking at the monomeric AF structure for full length H4; I checked further down 
in the predicted structures but the first ordered-ordered prediction has a model confidence of 
0.25 and does not find this mode of binding either. One could rerun this by taking the C-
terminal peptide of H4 as disordered region just to see whether AF would then get it right but 
in principle this is a false positive prediction; the N-terminal peptide also shares no sequence 
similarity with the C-terminal motif. 
 
run56: RARS1-CCDC115 
There is only one prediction that makes the cutoff for model confidence or/and motif pLDDT. 
This prediction involves RARS1 1-21 as a disordered fragment that is modelled to bind as a 
helix to the two helix coiled-coil domain of CCDC115. A shorter fragment of the motif is placed 
elsewhere. The helix of CCDC115 to which the peptide is predicted to bind has more 
hydrophobic residues along the helix on that side so I would think that a longer partner chain 
would be able to bind there. Thus, this interface does not seem likely to be true. 
 
run57: UBE3A-TAT 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run58: VAMP4-MFF 
140
Top prediction is two ordered regions that are both helical. Both proteins have only helical 
regions and the rest are disordered. Interestingly, despite the top predicted interface having 
only 0.71 model confidence, both chains have very high plddt for their residues at the interface 
(95 for VAMP4 and 90 for MFF). Because of their high plddt, it could be a genuine interface. 
The helix in VAMP4 definitely has an interface there because one side is rather hydrophobic 
while the other side is rather hydrophilic. MFF could bind there with its helix or via another 
helix that it has. The binding does not show that many nice contacts, i.e. some hydrophobic 
residues on the VAMP4 helix still remain exposed. 
 
run59: PEX16-MMGT1 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run60: PLP1-SLC16A2 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run62: SNRPB-GIGYF1 
GIGYF1 is a very long protein with many disordered regions. It has a GYF domain that is 
known to bind proline-rich sequences. SNRPB has many proline-rich sequences in its C 
terminus. Some proline-rich motifs are predicted with high pLDDT to bind the GYF domain 
(these are the top predictions). 
Another highly ranked prediction involves the LSM domain of SNRPB with various 
disordered fragments from GIGYF1. However, checking InterPro entry as well as structures 
showing LSM domain, it seems like LSM domain is predominantly involved in multimerization 
with other SNRP proteins to form the SMN complex involved in splicing (1H64). Therefore, the 
models involving this domain with disordered fragments look unlikely to be true to me. 
Digging deeper into the top predictions, comparing the binding modelled by AF 
between SNRPB 231-240 and GYF domain of GIGYF1 with 1L2Z, the peptide is oriented 
differently. However, from 3FMA, one can see different ways a peptide binds to the same 
surface of GYF. In 3FMA chain E and P show a similar way of binding to that modelled by AF. 
The peptide sequence in 3FMA is also different from 1L2Z, but importantly, there are three 
prolines in the peptide that always orient the same to the hydrophobic surface formed by the 
GYF motif on the GYF domain. This orientation of the 3 prolines is captured by AF.  
AlphaFold repeatedly predicts the PPGM motif in the same pocket. This motif occurs 
multiple times in the C-ter tail of SNRPB. On the ELM website, the LIG_GYF motif is described 
to bind proline-rich sequences and they also cite the structure 1L2Z but they say that flanking 
positively charged residues seem to be important for binding to the GYF domain. Indeed, in 
the crystal structure there are some negatively charged residues on the GYF domain. 
Interestingly, the GYF domain from GIGYF1 does not or only partially has those. It also differs 
in that it has a deeper hydrophobic pocket which is filled with a Trp in the crystal structure. So, 
it could well be that the GYF domain from GIGYF1 binds somewhat different proline-rich 
peptides. The interaction between GIGYF1 and SNRPB has not been described before other 
than in HuRI. Functionally, it would be probably a new connection because GIGYF1 is not 
known to function in splicing as far as I can see and thought to be localized to the cytoplasm. 
GIGYF1 however, has also interacted with SNRPA and SNRPC in HuRI. They also have 1 or 
141
some more occurrences of the PPGM motif. If this mode of binding is true then it would be 
somewhat of a new mode of binding or in the most conservative case an extension of the 
known binding mode of LIG_GYF.  
Alignment of 1L2Z chain A (GYF domain) with the GYF domain from GIGYF1 (476-
535) shows that the sequences are not very conserved. Structural superimposition of the two 
GYF domains reveal that the overall fold is conserved, including the majority of the binding 
pocket except for the hydrophobic pocket filled with a W. The peptides of the two structures 
have their PPPG in similar orientation. Following this sequence is a M from SNRPB that is 
tucked into the hydrophobic pocket and H for 1L2Z that is exposed to the environment. The 
sequence that follows is R for both, with the one in SNRPB exposed to the environment and 
possibly forming a hydrogen bond with the Q on the domain, and that in CD2 (1L2Z) forming 
salt bridge with an E from the domain.  
Later a structure of the GYF domain of GIGYF1 was published binding to a similar 
motif found in TNRC6 further supporting the correctness of these predictions. 
 
run63: ARHGEF9-VEZF1 
Top prediction has 0.74 model confidence with the fragment from VEZF1 (375-385) making 
contact with the RhoGEF domain of ARHGEF9. The top predictions all put the peptide at the 
same binding site of the RhoGEF domain. In terms of conservation, all the peptides from 
VEZF1 are well conserved. Nonetheless, the prediction looks like a very questionable one, at 
least it seems like the predictions do not make use of the GTP/GDP binding pocket for which 
I did not find a structure that shows where it precisely is located but based on an abstract of 
an article and InterPro entries it seems to be between both structural entities that form one 
larger domain, the GEF domain and the PH domain (IPR000219). There is absolutely no 
consistency in the two peptides from VEZF1 selected to bind to the same surface on the GEF 
domain of ARHGEF9; VEZF1 also seems to be of very weird type, AF has a hard time to make 
sense out of this protein. 
 
run64: MIP-MFF 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run65: VEZF1-PRKAR1B 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run66: VEZF1-KCTD7 
Top prediction involves the disordered region of VEZF1 (360-380) and the BTB domain of 
KCTD7. The disordered region overlaps with the top prediction in run63 that models the 
interface between VEZF1 (375-385) and the RhoGEF domain of ARHGEF9. Despite AF 
modelling a 310 helix structure in the disordered region of VEZF1 (360-380), the contacts 
modelled at the interface do not look very convincing. It could be that the disordered region 
(360-385) is a functional motif for other interactions and AF detects that and tries to fit it into 
the domain. It could also be that, to form the binding interface, it needs multiple copies of BTB 
domain, which is not used in this prediction. The VEZF1 peptide is put in the same pocket like 
142
the PAX6 peptides from run23 but the sequences look different, it is however the same peptide 
in VEZF1 like in the prediction with ARHGEF9. 
 
run67: APTX-FLAD1 
Has overlapping fragments with increasing confidence: APTX, N terminus disordered region 
5-12 and 6-13, paired with MoCF_biosynth or a domain of unknown type (not matched to a 
Pfam or SMART domain) that is between the MoCF and PAPS_reduct domain of FLAD1. It 
also predicts the same N-terminal region of APTX into the PAPS_reduct domain. The 
disordered fragments from the region 8-15 of APTX showed high confidence model confidence 
but below the cutoff pLDDT score when modelled with the PAPS_reduct domain of FLAD1. 
Checking the structure of PAPS_reduct domain in complex with adenosine phosphosulfate 
shows that the peptide is modelled by AF to be in the binding pocket of adenosine 
phosphosulfate. This is likely a false prediction. 
For the N-terminal part of APTX AF is quite confident when it models it into the MoCF 
domain or the other unknown domain of FLAD1. There are multiple predictions with different 
overlapping fragments that make the cutoff. However, AF is more confident with both metrics 
when the peptide is modelled into the MoCF domain. This domain has a pretty substantial 
pocket that is actually in the monomeric structure of FLAD1 occupied by another region of 
FLAD1 with low pLDDT.  However, when APTX 10-15 is used for modelling, the orientation of 
the peptide is reversed. MoCF_biosynth domain is known to trimerize for its activity and is 
known to bind molybdopterin. MoCF_biosynth binds molybdopterin on a site close to where 
AF models the peptide to be (refer to 1DI6, https://doi.org/10.1074/jbc.275.3.1814 that solves 
the structure of a bacterial protein with the same domain. They mentioned 49D and 82D to be 
important for catalytic activity) 
APTX with the unknown domain of FLAD1 does not reach the model confidence cutoff, 
only the motif pLDDT cutoff. It puts the same peptide as beta-sheet augmentation to the 
domain while in the predictions for the MoCF domain, the peptide is put in helical conformation. 
The only predictions where disordered regions in FLAD1 are predicted to bind to folded 
regions in APTX involve the FHA domain of APTX and correspond to two completely different 
disordered regions in FLAD1. 
 
run68: FBXO28-PSMC3 
Top prediction is coiled-coil interaction between regions from the two proteins that are 
modelled by AF monomer as long helices. The plddt of all residues are very high. This 
interaction looks convincing. The only problem is that one helix is shorter than the other, while 
for a common coiled coil interaction, both helices are usually equally long. 
The second best prediction based on model confidence involves a disordered region 
from FBXO28 (51-61). The modelled complex does not look convincing because the peptide 
is quite hydrophobic and the residues do not make much contact with the domain. The peptide 
is predicted to bind to the first domain of PSMC3 which as far as I was able to find, does not 
have catalytic activity. 
There are only these two predictions that make the cutoff for model confidence, none 
make the cutoff when looking for disordered regions in PSMC3 predicted to bind to FBXO28. 
The other way round there is the peptide mentioned above and a C-terminal disordered region 
of FBXO28 predicted to bind to the same first domain in PSMC3 but predicted to bind to a 
different side. The C-terminus of FBXO28 is very charged, maybe a localization signal. Both 
motifs in FBXO28 are somewhat recurrently predicted to bind to the domain in PSMC3. 
 
143
run69: CAMK2G-ESRRG 
Many high confidence predictions in a disordered region of CAMK2G. The whole disordered 
region used as a fragment for prediction also returned high confidence (0.78). In this long 
disordered region, AF puts the third highest model confidence peptide in the domain pocket. 
The top three highest confidences are very similar in terms of confidence. The motif detected 
by AF resembles LIG_NRBOX with the motif L..LL. CAMK2G 300-310: LKGAILTTMLV -> 
looks plausible to me because the M is hydrophobic and it is possible to substitute for the role 
of L in the regex. CAMK2G 315-325: SAAKSLLNKKS -> Also possible but the A is fitted into 
a quite deep hydrophobic pocket where known structure (refer to run21) shows that it is L that 
gets fit into the pocket. A might have too short of a hydrophobic side chain to make a good 
contact with the deep pocket. CAMK2G 355-365: QEPAPLQTAME -> not so good IMO 
because the hydrophobic contact is less extensive as the peptide found above. Another 
interesting observation: CAMK2G 285-423 (139 aa) prediction resulted in 0.78 model 
confidence, which is very high for a disordered region that long. In this case, CAMK2G 300-
310 is fitted into the hydrophobic pocket, adding weight to the fact that this could be the correct 
peptide. This reminds me of the extension analysis with DMI where extension of motif can 
improve prediction results. 
A pairing of ordered-ordered region prediction returned high confidence (0.83). This 
involves Zn finger from ESRRG and CaMKII association domain at the C terminus of CAMK2G. 
The binding is close to but not in the Zn binding pocket, which is good. CaMKII association 
domain of CAMK2 has been shown to oligomerize with other CAMK2 in 1HKX. 
Looking at the monomeric structure of ESRRG and CAMK2G, it looks possible that the 
C terminus association domain of CAMK2G to bind to ESRRG via Zn finger domain of ESRRG 
and the hormone receptor domain of ESRRG binds to the long and disordered region 
separating the two domains found in CAMK2G. This makes a multi-site binding between two 
proteins and a very interesting case. 
 
run70: XRCC4-LIG4 
The structure for this interaction has been solved: 3II6 and 1IK9. Looking at the structure of 
3II6, the two proteins interact with each other via XRCC4 first forming a homodimer with its 
coiled-coil domain, then around the homodimer binds the tandem BRCT domains of LIG4. The 
BRCT domains are separated by a structurally less defined region that most likely forms two 
helices upon binding to XRCC4. Not sure if this can be seen as domain-motif or domain-
domain interaction, probably something in between. It is not so clear from the monomeric AF 
model of full length LIG4 that both BRCT domains form a functional unit but I guess one could 
have also made a fragment comprising both domains and the linker sequence. Runs so far 
were made with both BRCT domains individually and the linker sequence individually and 
further rerun has to be done by using the BRCT domain tandem as one structural unit. 
The top prediction involves a motif at the C-terminus of XRCC4 that is predicted to 
bind to the last BRCT domain of LIG4. I think the prediction is wrong because of the solved 
structure. The prediction also does not look like how other motifs bind to BRCT, i.e. the protein 
FANCJ (LIG_BRCT_BRCA_1). However, the C-terminus of XRCC4 certainly carries one or 
two motifs. One is annotated in Proviz as WD40 domain binding. The very C-terminus is a 
class 3 PDZ-binding motif. The whole region is very conserved. Maybe this is why AF tries to 
put peptides from this C-terminus in various domains, including the DNA ligase domain of 
LIG4 (fourth top prediction). So, the top two predictions involve this C-terminus and reach high 
confidences in both metrics (model confidence and intf_avg_plddt). 
144
The third highest prediction involves the XRCC4 N-terminal domain plus one long helix 
(taken as one ordered region) and the 2nd BRCT domain. This interface is exactly the same 
interface that is seen in the structure 3II6 where part of the BRCT domain also contacts the 
XRCC4 helix. 
The 6th best prediction involves the linker between both BRCT domains and the 
XRCC4 helix. Despite the fact that XRCC4 is in monomeric form in our prediction and that the 
BRCT domains are missing, AF correctly models the contacts between the linker and the 
single XRCC4 domain as they can be seen in the structure 3II6. This model meets both cutoffs, 
for model confidence and pLDDT.  
Rerun using the BRCT domain tandem as one structural unit completed. The tandem 
BRCT fragment ranks 7th with the coiled coil XRCC4 fragment based on model confidence 
and second for ordered-ordered fragment pairs when ranked by avg interface plddt. The 
prediction that is still ranked first is the single BRCT domain binding to the coiled coil fragment 
(92 vs 89 avg intf plddt score). 
 
run71: TMEM237-MFF 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run72: HNRNPK-TH 
In the full length structural model of HNRNPK the first 2 KH domains are predicted to pack 
against each other using an interface that is also predicted to bind to the TH peptide 61-71. 
This region indeed overlaps with a Pfam HMM that seems to find some pattern in this 
disordered region but nothing is known about this “structural”(?) motif. It predicts 3 
occurrences of it in the N-terminal region of TH but the third one is the most conserved and 
this is the one predicted to bind to the second KH domain. Two other motifs overlapping with 
61-71 are also predicted to bind to this KH domain. The residues that are part of all three 
motifs are predicted to bind to the KH domain in the same way. One prediction below the 
model confidence cutoff predicts the motif to bind to the third KH domain but in a different way. 
 
run73: OTX2-RPS26 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run74: MFF-MMGT1 
Not inspected because none of the predictions returns model confidence or average interface 
pLDDT above cutoff (model confidence ≥ 0.7, ordered-disordered prediction with disordered 
fragment interface plddt ≥ 70; ordered-ordered prediction with intf_avg_plddt ≥ 75). 
 
run75: PUF60-TH 
The top prediction involves using both RRM domains of PUF60 as one ordered region and a 
disordered polyA peptide from TH. The peptide is put at the same position where the Nbox 
would bind as shown in the NMR structure 2KXH. However, the predicted peptide has some 
different sequence: solved structure: LxxAxxI, model: VxxAxxV, and there are no recurrent 
predictions. Another prediction involves the third RRM domain of PUF60 and another peptide 
in TH which tugs a Trp in a pocket but it does not look very convincing. 
145
Prediction involving disordered fragments from PUF60 and ordered region 
(Biopterin_H domain) from TH returned a maximum of 0.78 model confidence. This is likely 
false interface because the short peptide is fit into the biopterin and iron binding pocket of the 
enzymatic domain (refer to run72 for example). The second best prediction is also fitted at the 
same site, therefore also likely a false interface. 
Interestingly, the disordered region of PUF60 302-461 is modelled with 0.69 model 
confidence with the Biopterin_H domain of TH. The long disordered region makes contacts 
with two regions of the domain, one at the iron binding site (likely false) and another coiled-
coil interaction at the C terminal helix of Biopterin_H domain. This coiled-coil interaction is 
repeated in a shorter disordered fragment of PUF60 (317-347, third best prediction (0.77), the 
same C terminal helix in the long disordered region). This coiled-coil interaction looks like a 
plausible interface. 
I tried finding more information about this ACT-like domain but to no avail. InterPro 
says that it homo-dimerizes using the beta strands like in 1Q5V, but the fold is not exactly the 
same. The ACT-like domain in TH is special in the way that the last beta strand is formed by 
its N and C termini by looping back to meet each other. I cannot find much information about 
this domain. 
 
run76: PUF60-QRICH1 
One long disordered region of PUF60 (1-128) is modelled with high model confidence with 
DUF of QRICH1. In this region, 111-121 is modelled at the interface. This region when 
fragmented from the long disordered region also showed high confidence (0.86). This 
fragment tucks a R into a very deep negatively charged pocket but the rest of the peptide 
seems to make questionable contact with the DUF domain. 
Top prediction with ordered region in QRICH1 and peptides in PUF60 either put the 
linker helix between the first two RRM domains or the N-terminal long helix in PUF60 or 
another helical peptide at 442-461 at two different places on the DUF domain. I think that the 
helical linker between both RRM domains is not accessible for this mode of binding because 
the key residues are making intramolecular contacts to the RRM domains in the AF monomer 
PUF60 model. 
3 different peptides are predicted to bind to the tandem PUF60 domain. In principle, 
the long disordered N-terminal region of QRICH1 is full of potential helical peptides of pattern 
hydrophobic-x-x-Ala-x-x-hydrophobic, which is the kind of peptide that is like the Nbox motif 
that can bind to PUF60 and the three different peptides are also predicted to bind to the same 
pocket. 
There are also 4 different peptides in QRICH1 predicted to bind to the third RRM 
domain. 
 
run77: MAB21L2-AP1S2 
The top prediction involves Clat_adaptor_s domain of AP1S2 with the disordered fragment 
(215-220) of MAB21L2 (78 motif pLDDT, 0.77 model confidence). The motif is predicted 
recurrently with variable length but the disordered region is generally very short because it is 
a loop within the domain of MAB21L2. AF also made a disulfide bridge between motif and 
domain. Not sure this is correct. Looking at the structure 1W63 that shows the large Ap1 
clathrin adaptor core complex where there is a fold similar to the one in AP1S2, one can see 
that the region where the peptide is predicted to bind would in principle be accessible for 
binding. This domain Clat_adaptor_s is known to bind motifs from ELMDB but no structure 
has been solved in terms of this domain and its bound peptide. The disordered fragments from 
146
the previous point also do not match with any ELM class that binds to Clat_adaptor_s. Other 
good predictions use the Mab-21 domain of MAB21L2. Two overlapping disordered fragments 
(146-154, 0.68 and 153-157, 0.75) had good confidence with the domain but they are modelled 
to be at different binding sites, so it does not look likely to me that this is the binding region. 
 
run78: PRKAR1B-QRICH1 
The motif in PRKAR1B is at the very C-terminus of the protein and also matches a PDZ-
binding motif. There is only one prediction that makes the model confidence cutoff but it does 
not meet the pLDDT cutoff. The C-terminal peptide of PRKAR1B binds to the only domain of 
QRICH1 but extended or smaller versions of the motif are only predicted with very low score 
then to bind to the domain so no recurrence here. The prediction therefore looks unlikely to 
be functional. No other predictions make the pLDDT cutoff. 
 
 
147
Figure S1
A B Incorrect C 1.0
2000 17
Acceptable 15(12%)
0.8
1500 (11%) 0.6
73
1000 31 (54%)
High 0.4
(23%)
500 Medium 0.2
0 0.0
0 20 40 60 80 100 0 10 20 30 40
Pairwise domain sequence identity (global alignment) % Motif all atom RMSD (Å)
ns
D E F G nsns
Motif all atom RMSD (Å) DockQ 15.0 15.0
1.0 12.5
40 12.5
0.8 10.0
30 10.0
0.6 7.5 7.5
20 0.4 5.0 5.0
10 0.2 2.5 2.5
0 0.0 0.0 0.0
0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0
With template With template DEGDOC LIG TRGMOD Helix Strand Loop
ELM class Solved motif 2° structure
H I J K *
8 20.0
2 7 −4 17.510
1 6 15.0
0 5
−7
10 12.5
4 10.0
−1 −103 10 7.5
−2 2
−13 5.0
1 10−3 2.5
0
0 10 20 30 40 0.00 10 20 30 40 0 10 20 30 40 X-ray Others
Motif all atom RMSD (Å) Motif all atom RMSD (Å) Motif all atom RMSD (Å) Structures solved by
L M
1.0 Domain length
Motif length
0.8 Model confidence
0.6 Domain chain interface pLDDT 1.00
Motif chain interface pLDDT
0.4 Average interface pLDDT 0.75
0.2 pDockQ 0.50
iPAE
0.0 Domain chain interface residue 0.25
0 10 20 30 40 Motif chain interface residue 0.00
Motif all atom RMSD (Å) Residue-residue contact
−0.25
Atom-atom contact
Domain alignment RMSD (Å) −0.50
Motif backbone RMSD (Å) −0.75
Motif all atom RMSD (Å)
DockQ −1.00
Motif probability
Average motif hydropathy
Motif symmetry score
gth gthnce DT DT DTn n ckQ
E t t
e D D D iPA idu
e e ) ) )
idu c cnta nta  (Å  (Å  (Å ck
Q
bili
ty thy re
n le e
a o
ai otif
 l
onf
ide p
L
e p
L
e p
L pD
o
e r
es res  co  co SD SD SD D
o p c
c c c c ce ue m M M o
badro y s
om Mel c rfa rfa rfa rfa rfa sid -at
o nt Re Rm 
RM f prti tif h
y etr
D od int
e nte nte e e e e n o o m
m
M in in i ige in 
int  int e-r tom o at M
o y
a a a a ain idu A lig
nmack
b l  m
if a
l
age oti
f s
 ch  ch a
ain otif Av
er in c
h  ch es in if b ot er M
m M ma Mo
tif R ma Mo
t M
o A
v
Do Do D
 
Appendix Figure S1. Benchmarking of AF on DMI interfaces using minimal interacting 
regions. 
A Pairwise sequence identity of domains in the DMI positive reference dataset. B Proportion 
of high, medium, acceptable and incorrect models predicted by AF from the positive 
reference dataset as classified by the DockQ score. C Scatterplot of DockQ vs motif RMSD 
for DMIs from positive benchmark dataset. Pearson r = -0.85, p-value < 0.0001. D-E Motif 
RMSD and DockQ scores of structures for DMIs from positive benchmark dataset predicted 
by AF with and without the use of templates. Motif RMSD: Pearson r = 0.81, p-value < 
0.0001. DockQ: Pearson r = 0.88, p-value < 0.0001. F Accuracy of AF DMI predictions 
stratified according to the annotated functional categories of DMIs in the ELM DB. 
DEG=degron, DOC=docking, LIG=ligand, TRG=targeting, MOD=modification. G Accuracy of 
AF DMI predictions stratified according to the secondary structure element formed by the 
motif in the solved structure. H-J Scatterplot of various motif features vs motif RMSD 
determined for models and structures of DMIs from positive benchmark dataset: H motif 
hydropathy, Pearson r = -0.03, p-value = 0.72, I motif symmetry, Pearson r = -0.08, p-value 
148
Model confidence Without template
Average motif hydropathy Frequency
Without template
Motif symmetry score
Motif probability Motif all atom RMSD (Å)
DockQ
Motif all atom RMSD (Å)
Motif all atom RMSD (Å)
Pearson correlation coefficient
= 0.38, J motif regular expression degeneracy, Pearson r = -0.04, p-value = 0.66. K 
Accuracy of AF DMI predictions stratified according to the method used to solve the 
structures in the benchmark dataset, Mann-Whitney-Wilcoxon test two-sided p-value = 0.017 
test statistics = 811 L Scatterplot of model confidence of predicted models vs motif RMSD 
determined from superimposing the predicted models with structures of DMIs from the 
positive benchmark dataset. Pearson r = -0.55, p-value < 0.0001. M Correlation matrix of 
different prediction variables and prediction outcomes. 
 
 
149
Figure S2
A 1 mutation in motif 2 mutations in motif Randomly paired DMI
1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 Random Predictor
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate
B
1 mutation in motif 2 mutations in motif Randomly paired DMI
1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 Random Predictor
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall Recall
1.0 1.0 1.0
Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 Motif chain interface pLDDT
Average interface pLDDT
pDockQ
0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
C D E
LIG_MYND_1: ZMYND11 & MGA LIG_MYND_3: EGLN1 & FKBP8
1.0 2ODD 2ODD 
0.8
0.6
0.4
0.2
Mean of DockQ between
0.0 predicted models: 0.77
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate RPMPPKLAPGLKV LSELPPLEDMGQP
F G H
CREB3 (78-81) - HCFC1 THAP1 (134-137) - HCFC1 E2F1 (97-100) - HCFC1
I J Replicate 1 Replicate 2
 
Appendix Figure S2. Benchmarking and application of AF for DMI interface prediction 
using minimal interacting fragments. 
A Receiver operating characteristic (ROC) curve of various metrics extracted from AF 
models when using the DMI benchmark dataset as the positive reference and the following 
150
True Positive Rate
Average Precision Precision True Positive Rate
sets as random reference: Left, 1 mutation introduced in conserved motif position; middle, 2 
mutations introduced in conserved motif positions, right, randomly shuffled domain-motif 
pairs. B Precision recall curve of various metrics determined for benchmark datasets as in A. 
C ROC curve of mean DockQ between the top five AF structural models returned for a given 
input, assessed using the DMI positive reference set and random pairings of domains and 
motifs as in A. The AUROC of the metric is indicated in the legend of the ROC curve.  D-E 
Superimposition of AF structural model for motif class LIG_MYND_1 (D) and LIG_MYND_3 
(E) (orange) with homologous solved structures (PDB:2ODD) from motif class LIG_MYND_2 
(blue). The motif sequence used for prediction is indicated at the bottom, colored by pLDDT 
(dark blue=highest pLDDT). F-H AF models for three motif instances (orange) of LIG_HCF-
1_HBM_1 predicted to bind into a pocket on the Kelch domain of HCFC1 (gray). Motif 
positions are indicated below the figures. The key tyrosines of the motif sequences are 
drawn as sticks. I BRET50 estimates from fitting titration curves shown in Fig 1G are plotted 
vs. BRET values that were corrected for bleedthrough and measured at a 2:50 ng DNA 
transfection ratio for wildtype and mutant CREBZF-HCFC1 pairs. Error bars indicate the 
standard error. Data is shown for two technical replicates for the first biological replicate and 
three technical replicates for the second biological replicate. J Fluorescence and total 
luminescence are shown for wildtype and mutant CREBZF-HCFC1 pairs measured at a 2:50 
ng DNA transfection ratio. Error bars indicate STD of two technical replicates for the first 
biological replicate and three technical replicates for the second biological replicate. Coloring 
as in I. 
 
 
151
Figure S3
A 90 B 0.30 C
0 0 0 22.5
0.25
1 1 1 20.0
80
2 2 0.20 2 17.5
15.0
3 70 3 0.15 3
12.5
4 4 0.10 4
10.0
5 60 5 5
0.05 7.5
6 6 6 5.0
50 0.00
0 1 2 3 0 1 2 3 0 1 2 3
Domain extension step Domain extension step Domain extension step
D
Minimal motif + Minimal domain Minimal motif + Extended domain Extended motif + Minimal domain Extended motif + Extended domain
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 0.4 Random Predictor
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate False Positive Rate
1.0 1.0 1.0 1.0
Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDT
Average interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
E
Minimal motif + Minimal domain Minimal motif + Extended domain Extended motif + Minimal domain Extended motif + Extended domain
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall Recall Recall
1.0 1.0 1.0 1.0
Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDT
Average interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 iPAE
Residue-residue contact
Atom-atom contact
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
 
Appendix Figure S3. Effect of protein fragment extensions on the accuracy of AF 
predictions. 
A-C Heatmap of the average motif interface pLDDT (A), pDockQ (B), and iPAE (C) for 
combinations of different motif and domain sequence extensions using a positive reference 
set consisting of 31 DMI structures. Extensions like in Fig 2A. D ROC curves (top) and 
corresponding AUROC values (bottom) of various metrics extracted from AF models when 
using the DMI extension dataset split by different combinations of motif and domain 
extensions as indicated on the top of each graph. Gray horizontal line indicates the AUROC 
of a random predictor. E Precision recall curves (top) and area under the precision recall 
curve as quantified by average precision (bottom) for various metrics extracted from AF 
models determined for benchmark datasets as in D. 
 
 
152
Average Precision Precision Area Under the Curve True Positive Rate
Motif extension step
Average motif chain interface pLDDT
Motif extension step
Average pDockQ
Motif extension step
Average iPAE
Figure S4
A 1.0 1.0 Minimal motif + Minimal domain (All) B
Minimal motif + Minimal domain (Extended) RMSD RMSD
0.8 0.8 ELM classMinimal motif + Extended domain Ext 0 Ext 1
Extended motif + Minimal domain LIG_RPA_C_Vert 37.52 3.35
0.6 0.6 Extended motif + Extended domain
LIG_HOMEOBOX 24.84 0.49
0.4 0.4 LIG_Pex14_3 10.84 2.77
LIG_GYF 12.47 7.42
0.2 0.2 LIG_CAP-Gly_2 5.64 0.89
LIG_NBox_RRM_1 6.46 2.09
0.0 0.0
ce T T T Q E ce T T T Q E DOC_MAPK_JIP1_4 2.11 1.07en LDD LDD D
k A n D D D k A
nfid
c
e p e p e p
LD iPpDo nfid
e D pL L
D LD oc iP
l co ac ac ac  co ace ace
 p p D
ace
 p
de terf terf terf del terf terf terf Correct side-chain Correct pocketMo in in in in in
o in in in
cha cha rag
e M in in e 
e cha cha
g
era Correct backbone Wrong pocket
ain
 
otif
 Av  ain otif
 Av
Dom M Dom M
C Extension 0 Extension 1
LIG_HOMEOBOX: PBX1 & HOXB-1 (1B72) D
Randomly paired DDI
1.0 Model confidence
Average interface pLDDT
0.8 pDockQ
iPAE
Residue-residue contact
0.6 Atom-atom contact
Random Predictor
0.4
0.2
TARTFDWMKVKR 0.0
0.0 0.2 0.4 0.6 0.8 1.0
DOC_MAPK_JIP1_4: MK10 & 3BP5 (4H3B) False Positive Rate
E
Randomly paired DDI
1.0 Model confidence
Average interface pLDDT
0.8 pDockQiPAE
Residue-residue contact
0.6 Atom-atom contact
Random Predictor
0.4
DQFPAVVRPGSLDLPSPVSLS 0.2
LIG_Pex14_3: PEX14 & PEX5 (4BXU) 0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
F
Randomly paired DDI
1.0
Model confidence
Average interface pLDDT
0.8 pDockQ
iPAE
Residue-residue contact
0.6 Atom-atom contact
VASEDELVAEFLQDQNAP
LIG_GYF: CD2BP2 & CD2 (1L2Z) 0.4
0.2
0.0
PGHRSQAPSHRPPPPGHRVQHQPQKRP
LIG_RPA_C_Vert: RPA2C & UNG (1DPU)
EPGTPPSSPLSAEQLDRIQRNKAAALLRLAARNVPVGFGESWKKHLSG
 
Appendix Figure S4. Effect of protein fragment extensions on the accuracy of AF 
predictions. 
A True and false positive rate (left and right, respectively) based on optimal cutoffs from Fig 
2D derived for different metrics from ROC analysis for benchmarking AF with different motif 
153
True Positive Rate
False Positive Rate
Average Precision Precision True Positive Rate
and domain extensions from the reference dataset illustrated in Fig 2A and random pairings 
of domain and motif sequences. B Table indicating the motif RMSD achieved when using 
minimal (extension 0) or extended motif sequences for structure prediction for all inspected 
motif extension cases. Extension 1 refers to extension of the minimal motif sequence by the 
length of the motif to the left and right. Color coding indicates the accuracy classes of the 
respective structural models as shown in Fig 1A. C Superimposition of the structural model 
of the minimal (left, orange) or extended (right, yellow) motif sequence with the solved 
structure (motif in blue) for five different motif classes as indicated on the top of each panel. 
The motif sequence from the solved structure is indicated at the bottom of each panel. Motif 
residues are underlined, motif residues not resolved in the structure have a gray 
background. Sticks indicate the motif residues, domain surfaces are shown in gray based on 
experimental structures. D ROC curves of different metrics using the DDI benchmark dataset 
as positive reference and random shuffling of domain-domain pairs as negative reference. E 
Precision recall curves of different metrics extracted from AF models determined for 
benchmark datasets as in D. F Area under the precision recall curve as quantified by 
average precision for metrics extracted from AF models determined for benchmark datasets 
as in D. Gray horizontal line indicates the average precision of a random predictor. 
 
 
154
Figure S5
A B C
Motif all atom RMSD (Å) DockQ Model confidence
1.0 1.0
40
0.8 0.8
30
0.6 0.6
20 0.4 0.4
10 0.2 0.2
0 0.0 0.0
0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
AlphaFold MMv2.3 AlphaFold MMv2.3 AlphaFold MMv2.3
D E F
Motif chain interface pLDDT AF v2.2 AF v2.30.5 0.5
100 0 0
0.0 0.0
80 1 1
2 −0.5 2 −0.5
60
3 −1.0 3 −1.0
40
4 −1.5 4 −1.5
20 5 5
−2.0 −2.0
0 6 6
0 20 40 60 80 100 −2.5 −2.50 1 2 3 0 1 2 3
AlphaFold MMv2.3 Domain extension step Domain extension step
G
1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
Motif chain interface pLDDT
0.8 0.8 0.8 0.8 Average interface pLDDT
pDockQ
Residue-residue contact
0.6 0.6 0.6 0.6 Atom-atom contact
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
AUROC AF-MMv2.3 AUROC AF-MMv2.3 AUROC AF-MMv2.3 AUROC AF-MMv2.3
H
1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
Motif chain interface pLDDT
0.8 0.8 0.8 0.8 Average interface pLDDT
pDockQ
Residue-residue contact
0.6 0.6 0.6 0.6 Atom-atom contact
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
AP AF-MMv2.3 AP AF-MMv2.3 AP AF-MMv2.3 AP AF-MMv2.3  
Appendix Figure S5. Comparison of AF v2.2 and v2.3 prediction performance. 
A Scatterplot showing the motif RMSD obtained from structural models computed either with 
AF v2.2 or AF v2.3 using the minimal interacting regions of all annotated DMIs. B-D 
Scatterplots computed as in A showing the DockQ (B), model confidence (C), and motif 
chain interface pLDDT (D) for both AF versions. E-F Heatmaps showing the fold change in 
motif RMSD obtained for structural models from AF v2.2 (E) and AF v2.3 (F) upon domain 
or/and motif sequence extension compared to when using minimal interacting regions. 
Positive values indicate improved predictions from extension and negative values indicate 
worse prediction outcomes. G Scatterplots showing the AUROC obtained for different 
metrics derived from structural models from benchmarking AF v2.2 and AF v2.3 using the 
minimal interacting regions of all annotated DMIs or DDIs as the positive reference dataset 
and different random reference datasets: Left (DMI), 1 mutation introduced in conserved 
155
AP AF-MMv2.2 AUROC AF-MMv2.2 AlphaFold MMv2.2 AlphaFold MMv2.2
Motif extension step AlphaFold MMv2.2
log2(RMSDmin/RMSDext)
Motif extension step AlphaFold MMv2.2
log2(RMSDmin/RMSDext)
motif position; middle-left (DMI), 2 mutations introduced in conserved motif positions, middle-
right (DMI), randomly shuffled domain-motif pairs; right (DDI), randomly shuffled domain-
domain pairs. Corresponding ROC curves for AF v2.2 and AF v2.3 are shown in Fig. S2A, 
S4D, and S6A. H Scatterplots as in G plotting the average precision (AP) obtained from PR 
curves from the same analysis as in G. Corresponding PR curves for AF v2.2 and AF v2.3 
are shown in Fig S2B, S4E and S6B. 
 
 
156
Figure S6
A
1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 Residue-residue contact
Atom-atom contact
Random Predictor
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate False Positive Rate
B
1 mutation in motif 2 mutations in motif Randomly paired DMI Randomly paired DDI
1.0 1.0 1.0 1.0 Model confidence
Domain chain interface pLDDT
0.8 0.8 0.8 0.8 Motif chain interface pLDDTAverage interface pLDDT
pDockQ
0.6 0.6 0.6 0.6 Residue-residue contact
Atom-atom contact
Random Predictor
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall Recall Recall
C
1.0 1.0 1.0 Minimal motif + Minimal domain (All)
Minimal motif + Minimal domain (Extended)
0.8 0.8 0.8 Minimal motif + Extended domain
Extended motif + Minimal domain
0.6 0.6 0.6 Extended motif + Extended domain
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
nce DT DT DT ckQ nce DT DT DT Q ce T T T Q
fide D pL  pL
D pLD Do ide LD LD LD p o
ck ide
n LDD LDD LDD oc
k
n
l co fac
e
fac
e efac l co
nf e p e p e p pD onf e p e p e p pD
ode r r r
c c c c c
e rfa rfa rfa l c rfa rfa rfa
c
M in in
te
in in
te
e in
te Mo
d
in in
te
in in
te nte ode i e e
e
g ge M in 
int tin in e in
t
ha ha a a a a a a ag
ain
 c  c
otif Av
er  ch hn r
h
ai otif
 c Ave ain
 c
otif
 ch ver
Dom M m M m M
A
Do Do
D
0.16 850
0.14
0.12 110 632
0.10 45
0.08 668
0.06 867 1115 392488 429
0.04 85
144
336 357
0.02 704 477 798
2149
39 359
0.00
C20 S1 P1 P1 SF1 R1 S2 1 1 5 9 2 0 2 2 1 7 2D NO UB EA K
IP K
-C A -F -K 1
- -NBB 6 -TN HE
TA PLA PE
X P
AS D6 O
L OA F6 LU E2 AIP P P
P AR 6- -C C -P C F P R
A -US
M
2 N 1 B 2 M
D
CD L- SR B1 AP LR
P P53 C1- AB CD T1 2-P
D F2 4-NA 1 K1
-P 2-N
-
CN
A -LD
L
K 0 CL
1 -
R 2 M IA
P
OC E BU TPAB
P
P5-
G PD O X M X
B C
N
D2B
P U PE D MD P K1AP3 C M
SH
Randomly paired proteins  
Appendix Figure S6. Performance of different metrics derived from structural models 
when benchmarking AF v2.3 for DMI predictions. 
A ROC curves obtained for different metrics derived from structural models from 
benchmarking AF v2.3 using the minimal interacting regions of all annotated DMIs or DDIs 
as the positive reference dataset and different random reference datasets: Left (DMI), 1 
mutation introduced in conserved motif position; middle-left (DMI), 2 mutations introduced in 
conserved motif positions, middle-right (DMI), randomly shuffled domain-motif pairs; right 
(DDI), randomly shuffled domain-domain pairs. B PR curves computed for the same 
datasets and AF version as in A. C Optimal cutoff, true, and false positive rate derived for 
different metrics from ROC analysis for benchmarking AF v2.3 with different motif and 
domain extensions from the reference dataset used in Fig 2A and randomly shuffled domain 
157
Fraction of fragments
above threshold
Precision True Positive Rate
Optimal cutoff
True Positive Rate
False Positive Rate
-motif pairs. D Fraction of fragment pairs with structural models scoring above thresholds for 
20 randomly shuffled domain-motif pairs. Numbers on top of the bars indicate the total 
number of fragment pairs submitted for interface prediction to AF for each random protein 
pair. 
 
 
158
Figure S7
A Replicate 1 Replicate 2 Replicate 1 Replicate 2
B
C Motif iii mutated Motif iv mutated Domain mutated
Replicate 2
 
Appendix Figure S7. Expression and BRET50 plots for TRIM37-PNKP and ESRRG-
PSMC5. 
A Fluorescence and total luminescence are shown for wildtype and mutant TRIM37-PNKP 
pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three 
technical replicates. Data is shown for two biological replicates. B BRET50 estimates from 
fitting titration curves shown in Fig 4H are plotted vs. BRET values that were corrected for 
bleedthrough and measured at a 2:50 ng DNA transfection ratio for wildtype and mutant 
ESRRG-PSMC5 pairs. Error bars indicate the standard error. Data is shown for three 
technical replicates for two biological replicates each. BRET50 estimates for the second 
biological replicate for the ESRRG_M437F-PSMC5 pair were omitted from the graph 
because they exceeded the upper y-axis limit. Roman labels refer to interfaces shown in Fig 
4E. C Fluorescence and total luminescence are shown for wildtype and mutant ESRRG-
PSMC5 pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three 
technical replicates. Data is shown for two biological replicates. 
 
 
159
Figure S8
A B C
D Replicate 1 Replicate 2
E i
L8 L152
F
J
G
K
H
L Replicate 1 Replicate 2 Replicate 2
I ii
I355 N134 R348
 
Appendix Figure S8. Structural models, expression, and BRET50 plots for STX1B-
FBXO28 and STX1B-VAMP2. 
A BRET50 estimates from fitting titration curves shown in Fig 5C are plotted vs. BRET 
values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection 
ratio for wildtype and mutant STX1B-VAMP2 pairs. Error bars indicate the standard error. 
Data is shown for three technical replicates for two biological replicates each. B 
Fluorescence and total luminescence are shown for wildtype and mutant STX1B-VAMP2 
pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three 
technical replicates. Data is shown for two biological replicates. C Data shown as in A for 
wildtype and mutant FBXO28-STX1B pairs relating to interface iii (Fig 5A,D). D Data shown 
as in B for wildtype and mutant FBXO28-STX1B pairs shown in C. E Structural model 
corresponding to interface i shown in Fig 5A. Mutated residues on the domain (green) and 
motif side are labeled. F BRET titration curves are shown for wildtype and mutant FBXO28-
STX1B pairs relating to interface i shown in E with two biological replicates, each with three 
160
technical replicates. Protein acceptor over protein donor expression levels are plotted on the 
x-axis determined from fluorescence and luminescence measurements, respectively. G Data 
shown as in A for wildtype and mutant FBXO28-STX1B pairs relating to interface i. H Data 
shown as in B for wildtype and mutant FBXO28-STX1B pairs relating to interface i. I 
Structural model corresponding to interface ii shown in Fig 5A. Mutated residues on the 
domain (green) and motif side are labeled. J Data shown as in F for wildtype and mutant 
FBXO28-STX1B pairs relating to interface ii. K Data shown as in A for wildtype and mutant 
FBXO28-STX1B pairs relating to interface i. L Data shown as in B for wildtype and mutant 
FBXO28-STX1B pairs relating to interface i. 
 
 
161
Figure S9
A B C
v T90
F29
L107
3MK4
D EPEX3-PEX19 disrupt PEX3-PEX19 bind
F G H
 
Appendix Figure S9. Structural models, expression, and BRET50 plots for PEX3-
PEX19 and PEX3-PEX16. 
A Structural model of PEX3-PEX19 corresponding to interface v as shown in Fig 5G. 
Mutated residues on the domain (green) and motif side are labeled. B Structure from 
PDB:3MK4 showing the PEX19 N-terminal motif bound to the PEX3 domain. C BRET50 
estimates from fitting titration curves shown in Fig 5H are plotted vs. BRET values that were 
corrected for bleedthrough and measured at a 2:50 ng (for PEX3 and PEX3_T90Q) or 8:50 
ng (for PEX3, PEX3_R54S, PEX3_E272R) DNA transfection ratio for wildtype and mutant 
PEX3-PEX19 pairs. Error bars indicate the standard error. Data is shown for three technical 
replicates. The left panel corresponds to mutant constructs that should disrupt binding while 
mutants shown in the right panel were aimed to disrupt binding to PEX16 and thus should 
not disrupt binding to PEX19. D Fluorescence and total luminescence are shown for wildtype 
and mutant PEX3-PEX19 pairs measured at a 2:50 or 8:50 ng DNA transfection ratio (see 
panel C). Error bars indicate STD of three technical replicates. E Structural model obtained 
with AF for the trimeric complex of PEX3 (gray), PEX19 (yellow), and PEX16 (orange) using 
full length sequences as input. F PEX3 expression levels measured in luminescence units 
plotted for co-transfections with increasing PEX16 protein amounts measured in 
fluorescence units. Error bars indicate STD of three technical replicates. G PEX3 expression 
levels measured in luminescence units plotted for co-transfections with increasing PEX19 
protein amounts measured in fluorescence units. Error bars indicate STD of three technical 
replicates. H Data shown as in D for wildtype and mutant constructs of PEX3-PEX16 pairs. 
Measures are taken for 2:25 ng DNA transfection ratios. 
 
 
162
Figure S10
A
B GIGYF1 mutants - repl. 1 GIGYF1 mutants - repl. 2 SNRPB mutants - repl. 1 SNRPB mutants - repl. 2
C D GIGYF1 mutants - repl. 1 GIGYF1 mutants - repl. 2
 
Appendix Figure S10. Expression and BRET50 plots for SNRPB-GIGYF1. 
A BRET50 estimates from fitting titration curves shown in Fig 6D are plotted vs. BRET 
values that were corrected for bleedthrough and measured at a 2:50 ng DNA transfection 
ratio for wildtype and mutant SNRPB-GIGYF1 pairs. Error bars indicate the standard error. 
Data is shown for three technical replicates for two biological replicates each. B 
Fluorescence and total luminescence are shown for wildtype and mutant SNRPB-GIGYF1 
pairs measured at a 2:50 ng DNA transfection ratio. Error bars indicate STD of three 
technical replicates. Data is shown for two biological replicates. Coloring as in A. C Data 
shown as in A for wildtype and mutant SNRPB-GIGYF1 pairs fitted from titration curves 
shown in Fig 6E. D Data shown as in B for wildtype and mutant SNRPB-GIGYF1 pairs 
shown in C. 
 
 
163
Chapter 3
Systematic domain-motif interaction
interface and variant characterization
using protein interaction profiling
3.1 Development of domain-motif interface predic-
tor tool
To address the lack of mechanistic information on PPIs and the limita-
tion of the current bioinformatic tool in the prediction of PPI interfaces,
our former PhD student designed the DMI predictor tool. Here I will
discuss the workflow of the tool, its performance and its application on
HuRI interactome.
3.1.1 The workflow of the DMI predictor
The pipeline employed the UniProt identifiers for a pair of interacting
proteins (e.g. A & B). Within these protein sequences, it uses Hidden
Markov Models (HMMs) to identify the presence of known motif-binding
domains. At the same time, regular expressions are applied to detect
the occurrence of known motifs. Using a list of DMI types from the
ELM database, the pipeline pairs the identified domains and motifs to
generate putative DMI matches (Figure 3.1 A). These DMI matches
are then annotated with features such as ANCHOR and IUPred scores
(the propensity of motif disorderliness and the tendency to undergo a
secondary structure upon binding with a partner), RLC score (motif con-
servation score across orthologs), the degeneracy of motif types based on
their regular expression, the enrichment of the binding domain in the in-
teraction partners and frequency of motif-binding domains (Figure 3.1
B). The matches are then scored using a Random Forest (RF) model.
164
To train and evaluate this model for predicting DMIs, a positive ref-
erence set (PRS) and several versions of a random reference set (RRS)
were generated. The PRS is based on the 830 known DMI instances
from the ELM database, while RRS was created by randomly pairing
proteins and scanning for DMI occurrences (Figure 3.1 B). Each RRS
version was paired up with the PRS to train separated RF models, and
the performance was evaluated on test sets. Among these models, ver-
sion 4 generated by randomly sampling DMI instances from the entire
human interactome showed the best performance showing the Area Un-
der the Curve (AUC) of 0.93 for both ROC and precision-recall curves.
A cutoff score of 0.7 was established as the high-confidence DMI pre-
diction, resulting in a sensitivity of 66.3% and a specificity of 97.2%
(Figure 3.1 B).
The pipeline outputs the DMI matches along with their scores with
higher scores indicating a greater likelihood of being correct (Figure
3.1 A).
3.1.2 The application of the tool on HuRI PPI dataset
The developed DMI tool was applied to the HuRI dataset to detect
PPIs potentially mediated by the DMI interface. Due to the inherent
degeneracy of motifs, a large number of DMI matches were found within
HuRI PPIs. After applying the cutoff of predictions with high confidence
DMI match score (0.7), 13,406 high-confidence putative DMI interfaces
are identified across 3,195 PPIs. Among these interactions, 54% had
their top-ranked matches from the ligand (LIG) classes, and almost 20%
DMI matches from the modification (MOD) class (Figure 3.1 C).
165
Figure 3.1: The development of DMI predictor and its application on
HuRI. (A) Schematic illustrating the workflow of the developed DMI predictor.
Here is the improved list of DMI types and trained Random Forest (RF) model
incorporated into the DMI detection pipeline. (B) The top panel represents the
assembly of PRS and different versions of RRS. The middle panel illustrates the
annotation of features on the PRS and RRSs. The bottom panel represents the
ROC curve of RF models trained using different sets of RRS. For each RRS version,
ROC and PRS curves averaged across the triplicates of the RRS version were plotted
by interpolation. ROC for the PRS curve of the RF models. The importance of
different features to the RF trained using the PRS combined with the RRSv4 as
quantified using mean decrease in impurity. (C) The developed DMI predictor was
applied on PPIs that are detected in HuRI, and the scores of the predicted DMIs
are titrated over increasing cutoffs. The dashed lines refer to the right y-axis, while
the filled line refers to the left y-axis. The red vertical line implies the cutoff of 0.7
applied on the DMI scores to call a predicted DMI of high-confidence.
3.2 Integrating ClinVar mutation data with puta-
tive DMIs mapped on HuRI
The largest mutation database ClinVar (see Chapter 1, section 1.1)
contains a comprehensive set of patient mutation data. My colleague
processed this dataset by mapping mutations to proteins and applying
166
several filtering steps to the most recent version of ClinVar. The filtering
process included only germline, non-synonymous single nucleotide vari-
ants (SNVs) with definitive clinical significance, excluding other variant
types such as termination mutations. As a result, we have a total of
996,697 variants. Out of them, 45,035 are pathogenic, 73,806 are benign
and 824,374 are variants of unknown significance (VUS).
The filtered variants were then overlapped with high-confidence
domain-motif interfaces (DMIs) mapped on PPIs, focusing on those
where at least one pathogenic or VUS variant falls within a predicted
DMI. The PPI subset was visualized using a network tool, Cytoscape.
We identified a total of 6,057 potential high-scored DMIs with at least
one pathogenic or VUS mutation falling in the interface (Figure 3.2
A). As the subset is big for visualization and does not represent the
details I zoomed out PPIs of HDAC4 and SPOP to show how it looks.
Here HDAC4 has 6 partners with 6 high-confidence DMI predictions,
where 5 partners might mediate the interaction through the LIG motif
type interface and one interaction potentially occurs through the DOC
type motif interface. Another protein SPOP has 6 interactions with 3
DEG and 3 DOC motif type interfaces (Figure 3.2 A). Among the
DMIs in this subset, the most common SLiM type is LIG, with 2,867
instances, followed by MOD with 1,838 instances, and DOC with 881 in-
stances. The least frequent SLiM types are TRG (304 instances), DEG
(137 instances), and CLV (30 instances) (Figure 3.2 B).
A B
Figure 3.2: PPI network with predicted DMIs overlapped with ClinVar
mutations. (A) PPI network illustrating the mapped predicted high-confidence
DMIs with at least one pathogenic or VUS mutation overlapped. The blue nodes
represent proteins, and the edges indicate the predicted DMI. The colors represent
different SLiM types. (B) The bar plot illustrates the distribution of different SLiM
types across the PPI network illustrated in A. Each SLiM type .
167
3.3 The data-driven approach to select disease-
associated proteins and PPIs suitable for the
experimental validation of DMIs
To select PPIs suitable for the experimental validation of putative DMIs
I employed a data-driven approach annotating the PPIs with the subset
with experimental features with information regarding available ORF
sequences for these genes, which is important for candidate selection for
experimental work.
For this, I explored our ORFeome collection database to gather in-
formation on the presence of the clone in the ORFeome collection. As it
is essential to design an experiment close to native biological conditions,
I selected full-length ORFs. Furthermore, the established pipeline im-
plies the use of clonal ORFs to have a high success rate in cloning and
sequence validation. Additionally, I assessed the number and types of
mutations present at each interface mapped on PPIs.
Understanding the biological processes regulated by proteins encoded
by these genes is also crucial. To do this I imported this information
from UniProt and annotated the PPIs in the subset. Analyzing the can-
didates, I also checked how many partners these genes have. Given this
biological information, I manually assessed the validity of the DMI pre-
diction results. Some DMIs, despite having high match scores, did not
align with current biological understanding. For example, we predicted
an interface (DMI match score 0.741) involving WW domains and the
DOC_WW_Pin1_4 motif between WWOX and MYOZ2. Pin1 is a
multidomain protein with both a WW domain and a PPIase domain
that work together to target specific sequences. The WW domain of
Pin1 recognizes phosphorylated S/T-P motifs, while its PPIase activity
regulates various cellular processes.
However, this prediction might be inaccurate because, although both
Pin1 and WWOX contain WW domains and are involved in disease pro-
cesses, their functions are distinct. Pin1’s role as a PPIase with specific
substrate targeting and isomerization activity sets it apart from WWOX,
which does not perform isomerization but rather functions through pro-
tein interactions. The possibility that a highly-scored DMI might still
be incorrect highlights the need for further refinement of the tool and
underscores the importance of experimental validation, which is the next
step in our proposed strategy.
As a result, I selected 31 annotated gene candidates. I applied the
same approach and selected 105 gene partners. The selected candidates
and partners form the network of 117 protein-protein interactions illus-
168
trated in (Figure 3.3 A). In this PPI network, 86 PPIs are mapped
with predicted 88 domain-motif interfaces and 27 PPIs (found in HuRI)
mediated by known 31 DMIs previously studied and annotated in the
ELM database (see chapter 1 section 1.2) serve as positive controls
for DMI validation. Since some candidates only had predicted interfaces,
I included 4 additional partners where interactions (not found in HuRI)
are mediated by known DMIs reported in the literature. Additionally,
I included 78 partners that interact with the candidates via different
interfaces, which will serve as negative controls.
3.3.1 Retestement of PPIs using BRET assay
We first cloned and sequence-verified the selected candidates to confirm
protein expression after transfection into mammalian cells. If success-
ful, we can then clone their interacting partners, making the cloning
step more efficient. Our prior experience with the BRET assay indi-
cated that proteins are better expressed when fused to Nanoluc (NL)
luciferase at the N-terminus. Therefore, the candidates were genetically
fused to the NL tag. Using the established cloning pipeline from Aim 1
(see Chapter 1 section 1.5) I successfully cloned ORFs for 19 can-
didate proteins. For the failed candidates, a second round of cloning
was attempted. However, 3 ORFs yielded no growth in inoculation cul-
tures, while sequencing of the remaining 8 showed either empty vectors
or incorrect ORFs, suggesting that it may have happened due to cross-
contamination. These results also showcase that the cloning step, par-
ticularly the manual picking of the colonies might lead to false-positive
results.
For these successfully cloned ORFs, 114 partner ORFs were fused N-
terminally to mCitrine, and 96 ORFs were successfully cloned, resulting
in 96 PPIs available for detection in the BRET assay. As a result, we
obtained significant BRET signals for 46 of these 96 PPIs with the valid
expression of proteins (Figure 3.3 B). This retest rate surpasses those
of gold-standard PPI datasets used in previous benchmarks of various
binary PPI assays, including the BRET assay, highlighting the overall
detectability of PPIs from HuRI (Trepte et al. 2018; Braun et al.
2009; Choi et al. 2019). We obtained significant BRET signals for
46 ( 48%) of these 96 PPIs with proteins expressed higher than the
cutoff. This retest rate is notably higher compared to the retest rates
of gold standard PPI datasets used in past benchmarking of various
binary PPI assays, including this BRET assay, highlighting the enhanced
detectability of PPIs from HuRI (Trepte et al. 2018; Braun et al.
2009; Choi et al. 2019).
169
Among these 46 PPIs, we selected 23 interactions involving 6 can-
didates (CTBP1, WWOX, PPP3CA, REPS1, SPOP, and IQCB1) for
validating the predicted interfaces (Figure 3.3 B). The remaining 24
PPIs were not selected for further analysis because they involved 8 can-
didates with incomplete data, either missing known DMIs or consisting
only of negative controls. For instance, PUF60 was detected with PPIs
mediated solely by known DMIs or through different interfaces
Figure 3.3: Experimental validation of predicted DMIs on PPIs. (A) PPI
network illustrating selected DMI predictions and experimental retesting in BRET
assay. (B) cBRET, total luminescence and fluorescence for 96 PPIs, where 31 PPIs
have putative DMIs. Luminescence and fluorescence measurements indicate NL and
mCit fusion protein expression levels, respectively. Black horizontal lines indicate
expression level and PPI detection cutoffs. The gray vertical line separates the
detected (left) from undetected PPIs. Protein pairs in bold indicate those selected
for interface validation via site-directed mutagenesis. Error bars indicate STD of
three technical replicates.
To design mutants for the experimental DMI validation and variants
from patients (see section 3.4) I used the predicted interface AF-MM
structures run by my colleague. I visualized the predicted structures
with the protein structure visualizing tool, PyMol to guide the design.
We manually designed single point mutations at potential motif and
domain sites of interacting protein pairs, along with deletions of motifs
or regions, resulting in 2-4 mutations per motif and domain. In total,
we designed 55 mutations that fall into the predicted DMI and likely
disrupt the interaction.
Next, I cloned the designed mutations using adapted to medium-
throughput site-directed mutagenesis (see Appendix 5.1.2) and suc-
cessfully cloned 44 mutants, 18 for domain and 27 mutants for motif
validation. The expression of the mutated proteins was tested and com-
pared to wild-type proteins (see Appendix, Figure 5.1 ). Mutants
with low expression (e.g. motif deletion of LITAF) might interfere less
170
or not at all with the protein, potentially leading to false negative re-
sults. Consequently, these low-expressing mutants were excluded from
further validation.
The successfully cloned and expressed protein candidates and their
partners were used further for DMI validation using BRET saturation
assay (Trepte et al. 2018; Lee et al. 2024). In this assay, I gener-
ated mutated constructs and performed a donor saturation experiment,
where the amount of NL-candidate ORF construct (1 and 2ng) encod-
ing NL-fused proteins, were co-transfected with increasing amounts of
mCit-partner ORF (12.5, 25, 50, 100, 200 ng) encoding mCitrine-fused
proteins performing in total 6 measuring points. Thus, with an increased
concentration of acceptor protein, the BRET signal should increase un-
til it attains a saturation value called maximum BRET. This saturated
BRET value is reached when all the donor molecules interact with the
acceptor molecule.
3.3.2 Testing the localization of the wild-type proteins and
mutants using Bioluminescence Imaging
As was mentioned in section 1.5 the disruption of protein-protein inter-
action may happen due to the mislocalization of the mutant rather than
the effect of the mutant on the interaction. One of the advantages of
the BRET assay is that the tags for interaction testing can also be used
to monitor protein location within the cell and the BRET signal can
even be visualized in live cells via bioluminescence imaging, shortly BLI
(Goyet et al. 2016; Kobayashi et al. 2019). It was also shown that
it can be scaled up using a high-content screening (HCS) microscopy
(J. Kim et al. 2016). Thus, with the support of the microscopy core
facility at IMB, we were motivated to perform BLI by using a 96-well
plate format on an HCS microscope, named Opera Phenix.
To do this, we selected some of those mutants for DMI vali-
dation (TGIF1_24_28del, DMRTB1_21_25del, CPSF6_323_327del,
FAM167A_3_9del) as well as patient variants (DMRTB1_R25H,
WWOX_H37D,LITAF_Y61D, FAM167A_V8M) that showed the ef-
fect on the binding affinity of the interactions compared to the wild-type
(see subsection 3.3.3). The selected mutants and variants, paired
with wild-type partners at a ratio of 10:10 ng, were transfected into pre-
seeded U2OS cells in a 96-well plate using Fugene as the transfection
agent. Upon transfection, cells were incubated for 24 hours. The follow-
ing day, DRAQ5 and CellMask dyes were applied to stain the nucleus
and cytoplasm, respectively (data not shown), and the cells were im-
aged immediately using the Opera Phenix system. Initially, fluorescence
171
was imaged in each well. To detect luminescence, furimazine substrate
(from the Nano-Glo kit) was added to the wells, enabling the oxidation
of NanoLuc luciferase for luminescence detection.
Below, I will first discuss the results of validating predicted inter-
faces and microscopy data. For the negative controls, which lack re-
solved structures, we employed the AF-MM fragmentation approach
(see Chapter 2, Article II) to predict potential interfaces. This
method helps us infer interaction sites in the absence of structural data,
providing insights into the validity of our predictions and the reliability
of the negative controls.
3.3.3 Validation of DMI predictions
Experimental validation of interfaces involving CTBP1 inter-
actions
CTBP1 is a transcriptional co-repressor. Unlike many transcription fac-
tors, CTBP1 does not directly bind DNA (Filograna et al. 2024;
Valente et al. 2013). Instead, it interacts with transcription fac-
tors through a hydrophobic cleft in its substrate-binding domain, which
recognizes the PxDLS motif. This cleft is crucial for recruiting other
corepressor components such as histone deacetylases (HDACs), methyl-
transferases (HMTases), and additional transcriptional repressors neces-
sary for its repressor activity (Filograna et al. 2024; Valente et al.
2013). For the CTBP1 candidate, we have cloned the partners with the
same interface LIG_CtBP_PxDLS_1 class, TGIF1 and IKZF1 with
known DMIs, partner DMRTB1 with predicted interface and CTBP2 as
a negative control, meaning that this interaction likely happens through
a domain-domain interface.
CTBP1-TGIF1
CTBP1 binds to the PLDLS motif of the transcription factor, TGIF1
(Figure 3.4 A). This interface has been functionally studied and anno-
tated in ELM (Melhuish 2000), but no crystallized structure is avail-
able. The predicted AF-MM structure with a high confidence score of
0.8, suggests that the proline at position 24 of TGIF1 fits well into the
hydrophobic pocket of CTBP1. Furthermore, two leucines contribute to
beta augmentation, allowing the sidechain of the motif to enter a deep
hydrophobic groove. In addition, a negatively charged aspartate is in
proximity to phenylalanine, a non-polar hydrophobic residue (Figure
172
3.4 B). This suggests that phenylalanine’s aromatic ring might be in-
volved in pi-stacking interaction by stabilizing the interface.
I mutated residues A41 and C27 in the CTBP1 binding pocket that
interacts directly with the motif, as well as a residue K54A, which is away
from the motif and likely will not affect the binding (Figure 3.4 B).
Additionally, we deleted the motif from TGIF1 to potentially completely
disrupt the interaction (Figure 3.4 A). The BRET data showed that
mutations A41D, C27E, and C27D in CTBP1 completely disrupted the
interaction with TGIF1 (Figure 3.4 E), whereas the K54A mutation
did not disrupt the binding. The deletion of the motif in TGIF1 showed
the loss of interaction. The expression data is shown in the Appendix
(Figure 5.2 A ). The microscopy data suggests that the mutant with
the removed motif is localized similarly to the wild-type (Figure 3.5)
173
Figure 3.4: DMIs mediating CTBP1-centric PPIs (A) The schematic illustra-
tion of CTBP1 interactions mediated by predicted LIG motif type (green) predicted
(the arrow end pointing to motif) and known (half-circle end pointing to motif)
DMIs, and negative control (gray) interaction mediated by different interface, DDI.
(B-D)Predicted by AF-MM interface interaction structures of CTBP1 with TGIF1
(B, known DMI), IKZF1 (C, known DMI), and DMRTB1 (D, predicted DMI). The
interacting CTBP1 domain (gray) with highlighted residues (blue), mutated for do-
main validation, is shown. Motifs (green) and flanking regions (white) are indicated
for each interaction. (E-G) Experimental confirmation of known DMIs (E, CTBP1-
TGIF1), (F, CTBP1-IKZF1) and validation of putative DMI (G, CTBP1-DMRTB1)
using saturation assays, with BRET measured as a function of acceptor/donor ex-
pression ratio. The left panels show saturation curves for wild-type CTBP1 and
single-point mutants (A41D, C27E, C27D, K54A) for domain validation. The right
panels display binding curves for the wild-type partner proteins (TGIF1, IKZF1,
and DMRTB1) and their mutants with deleted motifs. (H-I) Predicted structure
of the negative control PPI using the AF-MM fragmentation approach (H), where
CTBP1 (gray) and CTBP2 (dark gray) interacting domains are shown, with CTBP1
residues (blue) mutated for domain validation and experimental validation of the
CTBP1 domain being part of the DDI interface between CTBP1 and CTBP2 (I).
CTBP1-IKZF1
Another notable interaction involves CTBP1 and the PEDLS motif of
the transcription factor IKZF1 (Figure 3.4 A). It is a DNA-binding
protein that regulates transcription through association with HDAC-
dependent and independent complexes. The previous study tested if
the conserved PEDLS motif in IKZF1 was crucial for this interaction by
creating mutations that either deleted this sequence or altered its core
amino acids (Koipally 2000). The mutated IKZF1 proteins failed to
174
bind CTBP1, confirming the importance of the PEDLS motif for their
interaction. Similar to TGIF1, the motif was predicted to fit the cleft
of the CTBP1 domain (Figure 3.4 C). However, negatively charged
glutamate on the IKZF motif and positively charged lysine on CTBP1
forming an electrostatic interaction. Therefore we expect that K54A
might slightly affect the binding.
Validation with the same CTBP1 mutants and deletion of the IKZF1
motif showed only partial disruption of the interaction. This partial per-
turbation suggests that while the PEDLS motif is crucial, other factors
may also contribute to the binding stability (Figure 3.4 F). For exam-
ple, IKZF1 contains zinc finger domains essential for DNA binding and
dimerization (Figure 3.4 A), which might still interact with CTBP1
through these domains. This hypothesis needs to be tested with addi-
tional downstream experiments such as mutation of zinc finger domains
of IKZF1 protein. The obtained microscopy data indicate that the lo-
calization of the mutant with the removed motif in IKZF1 is similar to
the wild-type protein (Figure 3.5 B).
A NL-CTBP1 mCit-TGIF1 NL-CTBP1 mCit-TGIF1 Merged C NL-CTBP1 mCit-DMRTB1 NL-CTBP1 mCit-DMRTB1 Merged
NL-CTBP1-mCit-DMRTB1
NL-CTBP1-mCit-TGIF1
NL-CTBP1-mCit-DMRTB1_174_178del
NL-CTBP1-mCit-TGIF1_153_157del
B NL-CTBP1 mCit-IKZF1 NL-CTBP1 mCit-IKZF1 Merged
NL-CTBP1-mCit-IKZF1
NL-CTBP1-mCit-IKZF1_34_38del
Figure 3.5: The localization of wild-type and mutants. Bright-field mi-
croscopy image of U2OS cells showing luminescence (magenta) indicating the pres-
ence of NL-CTBP1 and fluorescence intensity (cyan) of mCit-TGIF1. The images
depict the localization of the wild-type proteins (top panel) and the mutant with
the removed motif (bottom panel) relative to the wild-type. Scale bar = 10 µm.
CTBP1-DMRTB1
The same domain of CTBP1 was predicted to bind a PLDLR motif of
DMRTB1 with a high score of 0.881. This motif is annotated in ELM
175
but is found in murine HDAC9. However, this potential interface be-
tween CTBP1 and DMRTB1 has not been discovered yet (Figure 3.4
A). The AF-MM structure also looks very promising. The PLDL part of
the motif fits the hydrophobic pocket of CTBP1 similar to known SLiM
instances mentioned earlier, while positively charged arginine residue
and glutamate on the domain form salt bridges that contribute to the
stabilization of the interaction (Figure 3.4 D). The BRET data sup-
ports the prediction findings, with domain mutations significantly re-
ducing the binding and the deletion of the motif leading to the loss of
interaction (Figure 3.4 G). We also showed that DMRTB1 mutant is
localized similarly to the wild-type protein (Figure 3.5 C), while the
expression data shows that the Depression of DMRTB1 is slightly higher
compared to the wild-type (see Appendix, Figure 5.2 A).
CTBP1-CTBP2
As a negative control, we confirmed the PPI of CTBP1 with CTBP2.
Although CTBP1 and CTBP2 proteins share 78% amino acid identity
and 83% similarity, there are slight differences in their sequences that
contribute to their distinct functions (Ding et al. 2020). For example,
CTBP2 has a nuclear localization signal at the N-terminus but lacks
a PDZ-binding domain. Previously, it was demonstrated that both
CTBP1 and CTBP2 contain an NADH-dependent homo- and hetero-
dimerization domain, which facilitates dimerization in response to in-
creased NADH levels (Figure 3.4 A). This dimerization further pro-
motes the nuclear retention of CTBP1.
Currently, there is no resolved structure available for the interac-
tion interface of CTBP1 and CTBP2. To address this, we employed
a fragmentation approach using AF-MM. The AF-MM prediction was
consistent with previous studies, suggesting that the domains of CTBP1
interact with CTBP2 to form a dimer (Figure 3.4 H). BRET assay in-
dicated that single-point mutants on the PxDLS-binding cleft do not
disturb the binding, suggesting that this cleft is not essential for the
dimerization of CTBP1 and CTBP2, which is also in line with the pre-
dicted structural model (Figure 3.4 I).
Experimental validation of interfaces involving WWOX inter-
actions
WWOX is a putative oxidoreductase: it has two WW domains (WW-1
and WW-2) maintaining many interactions, NLS and an SDR (steroid
176
dehydrogenase) domain involved in metabolism. DMI predictor tool
predicted that WWOX binds via the same interface LIG_WW_1 and
LIG_WW_3 SLiM classes with LITAF (known as two DMI interfaces),
CPSF6 (three interfaces), and DAZAP2 (one DMI). Additionally, neg-
ative control partners HOXA1, CSNK2B and SNRPC are used.
WWOX-LITAF
LITAF plays a role in endosomal protein trafficking and targets proteins
for lysosomal degradation (Lee et al. 2011). It consists of two short
PPSY motifs at the N-terminus and SLD domain with a hydrophobic
cysteine-rich core region anchored to the membrane of the lysosome
(Figure 3.6 A). Previously, it found that the WW-1 domain binds
specifically to PPSY motifs in LITAF, whereas the WW-2 domain does
not (Ludes-Meyers et al. 2004).
Using AF-MM, we predicted a high-confidence (0.8) structural model
of the interface between the first motif (20-23) and tandem WW do-
mains. The structure suggests this motif is recognized by WW-1
(Figure 3.6 B (i)). We also predicted the structure (with the same
confidence score) of the second known interface of the second PPSY
motif (58-61) and tandem WW domains. Similarly, the second motif
prefers interaction with the WW-2 domain (Figure 3.6 B (ii)). The
prolines and tyrosine residues on the motif fit into the pocket WW1
containing tryptophan and tyrosine (Figure 3.6 B).
The prediction indicates that both motifs might interact with the
WW1 domain, though they bind in the same manner, suggesting mul-
tivalency, where multiple interactions between identical (by sequence)
motifs and one domain occur. To confirm these interfaces, we designed
mutated residues on the WW1 domain and motif (Figure 3.6 B). In
addition, I also generated motif deletions, each separate and N-terminal
part. However, the expression levels were lower than the threshold (see
Appendix Figure 5.2 B) and we excluded these deletions for further
study.
Experimental validation showed partial disruption of binding with
mutations Y33H, Y33D, and W44K in WW1, while mutations on pro-
lines and tyrosines in the motifs had varying effects. Replacement of ty-
rosines in motifs with aspartate completely disrupted binding (Figure
3.6 F).
In contrast, Ludes-Meyers et al. (2004) demonstrated that mutating
tyrosine to alanine on the first motif significantly reduces binding, while
mutating tyrosine to alanine on the second motif does not affect bind-
ing. Alanine is a small, non-polar amino acid that lacks the aromatic
177
side chain of tyrosine. The loss of this aromatic interaction might sig-
nificantly reduce but not eliminate the binding, as observed in Meyers’s
study.
Figure 3.6: DMIs mediating WWOX-centric PPIs (A) Schematic illustra-
tion of PPIs mediated by DMIs. The edge ending points towards the predicted
motif, where the arrow implies predicted DMI, while the half-circle points to the
known DMI and gray indicates interaction mediated by different interfaces, where
the question means that this interface was predicted by AF-MM using fragmenta-
tion approach. (Bi) Predicted interface interaction structure of the WW1 domain
with the first PPSY motif in the WWOX-LITAF interaction. The structure high-
lights mutated residues on the domain (in blue) and on the motif (in green), with
arrows pointing to these residues. (Bii) Predicted interface interaction structure
of the WW-1 domain with the second PPSY motif of WWOX-LITAF interaction.
(Cii) Predicted structure illustrating the second motif on CPSF6 and tandem WW
domains as shown in A scheme. (Ciii) Predicted interface interaction structure of
the WW-1 domain with the third motif of WWOX-CPSF6 interaction. (D) The
putative model of the motif on DAZAP2 and tandem WW domains. (E) Predicted
interface of the negative control PPI. (F) Experimental confirmation of known DMIs
of WWOX-LITAF using BRET saturation assay. (G) Experimental validation of
predicted DMIs of WWOX-CPSF6 using BRET saturation assay. (H) Experimental
validation of putative DMI of WWOX-DAZAP2 using BRET saturation assay. (I)
Experimental validation of whether the domain is involved or not in the interface of
the negative control.
WWOX-CPSF6
DMI predictions identified three potential interfaces between WWOX
and CPSF6: one with the PPPY motif and two with the TPPRP and
FPPRP motifs located at the C-terminus of CPSF6 (3.6 A). One study
178
identified several novel interactions involving WW domains using mass
spectrometry. They found that CPSF6 is associated with the WW-1
domain of WWOX. They further investigated whether specific proline-
based peptide motifs are present in proteins bound by WW domains
and found that CPSF6 contains PPPY motif as a potential interface
between WWOX and CPSF6. However, no validation of this interface
was done in this study (Ingham et al. 2005).
AF-MM predictions indicated that the FPPRP motif binds to the
WW2 domain (Figure 3.6 C (ii)), while the PPPY motif is more be-
tween WW domains (Figure 3.6 C (iii)). BRET experiments showed
that single-point mutants of residues on the WW-1 domain significantly
disrupted the binding. (Figure 3.6 G, right panel). Moreover, the
mutant with removed motif FPPRP on CPSF6 partially disrupted the
binding, similar to the effect of the mutant with the deleted third motif,
PPPY, on CPSF6 (Figure 3.6 G, left panel). Given our predictions
and experimental data, one might speculate that the mutant with all
removed motifs might completely disrupt the binding.
WWOX-DAZAP2
The same DMI interface of LIG_WW_1 class was predicted by the DMI
predictor tool with a DMI match score of 0.9 for tandem WW domains
of WWOX and PPAY N-terminal motif in DAZAP2 (Figure 3.6 A).
AF-MM model of the interface proposes the PPAY motif to fit well the
hydrophobic groove formed on the WW-1 domain. In the WW1 domain,
the tryptophan residue (W44) and tyrosine residue (Y33) are positioned
in a way that allows aromatic stacking with the proline residues of the
PPAY motif (Figure 3.6 D). The involvement of the WW-1 domain in
the predicted interface was experimentally validated, demonstrating the
reduction in binding of domain mutants (Figure 3.6 H, left panel).
The deletion of the predicted motif on DAZAP2 slightly affected the
interaction with WWOX (Figure 3.6 H, right panel).
WWOX-HOXA1
I used HOXA1 as the negative control, assuming the interaction is me-
diated via a different interface (Figure 3.6 A). No DMI prediction
was found on this interaction upon the application of the DMI predic-
tor tool. HOX1 does not contain PPxY (where x represents any amino
acid), PPLP or xPPRX motif recognized by WW domains. Using the
fragmentation approach, my colleague predicted the potential interface.
The WW tandem domain in WWOX is modeled with the disordered re-
179
gion 294-302 of HOXA1 with moderate confidence (pLDDT 73) (Figure
3.6 E). Here the 294-300 (PISPATP) of HOXA1 matches the regex of
LIG_SH3_3 that binds to the SH3 domain. According to the predicted
structure, two prolines at the C terminal of the peptide stack nicely with
aromatic sidechains (W and Y) from the WW domains in a similar way
as the LIG_WW_1 class. However, BRET experiments showed WW1
mutants did not change the binding with HOXA1, meaning that this
domain might not interact (Figure 3.6 K). This data also showed the
limitation of AF-MM in specificity.
WWOX-SNRPC
Another negative control is WWOX and its partner SNRPC in BRET
(see Appendix). However, we had a DMI prediction that slightly scored
below the cutoff, where the LIG_WW_3 motif within the proline-rich
region of SNRPC was predicted to bind to the WW1 domain. The titra-
tion studies showed that the mutant constructs on the domain and the
deletion of the potential motif as well as the whole proline-rich region
left the interaction intact. With these conditions, we could not validate
this interface prediction (see Appendix, Figure 5.3 C & D). We
also tried the fragmentation approach using AF-MM, where the only
promising prediction that survives the cutoff is an ordered-ordered pair.
The prediction involves the Zn finger from SNRPC and the C-terminal
SDR domain from WWOX (see Appendix, 5.3 A), but we did not
test this prediction experimentally. Given these findings, the DMI pre-
dictor returned interface prediction, suggesting that the GPPRP motif
of LIG_WW_3 class binds to WW domains is likely wrong. This mo-
tif is recognized by group III WW domains, whereas WWOX contains
WW domains from group I. Our structural data also showed that pre-
dicted.PPR. motifs from class LIG_WW_3 class are predicted to be
positioned away from the binding groove. These data also point to the
inability of DMI predictors to discriminate domain preference within
domain class.
WWOX-CNSK2B
This PPI was also annotated as the negative control. Similar to HOXA1,
CSNK2B does not have proline-rich stretches and we did not have any
DMI predictions. AF-MM prediction was not done. BRET signals of
the mutant did not differ from the wild-type protein titration results
(see Appendix 5.3 E & F).
180
Experimental validation of interfaces involving IQCB1 inter-
actions
IQCB1 contains a tyrosine phosphorylation site, a coiled-coil region, and
three helical calmodulin-binding motifs. The calmodulin-binding motif
is a ligand type motif with the consensus [I,L,V]QxxxRGxxx[R,K] with
characteristic residues being a hydrophobic residue at position 1, highly
conserved glutamine at position 2, basic charges at positions 6 and 11,
and a variable glycine at position 7. Two of these motifs are known and
also annotated in the ELM database (321-336, 391-407) (X. Luo et al.
2005). Whereas the third motif (298-314) was predicted by the DMI
predictor (Figure 3.7 A). The DMI tool predicted these motifs interact
with the EF-hand repeat domains of CALM1 and CALML3 proteins.
IQCB1-CALM1
The DMI predictor gave a high DMI match score of 0.9 and found
these motifs potentially interacting with the tandem EF-hand domains
of CALM1. Upon binding of four Ca ions through these motifs, CALM1
changes its conformation from a closed form to an open one, exposing
a hydrophobic surface capable of interacting with different target pro-
teins. AF-MM predicted the interface of the third motifs and the tandem
EF-hand domains (Figure 3.7 B). The predicted model suggests that
the IQCB1 motif is tightly wrapped and embedded within the binding
pocket of CALM1. Validation experiments were limited due to the non-
canonical isoform 2 of the IQCB1 clone, which lacks two full first and
second motifs. We had one successful mutant E120H for domain valida-
tion which is predicted to be away from the IQCB1 motif (Figure 3.7
B (iiia)). However experimental data showed that E120H did not likely
affect the binding with IQCB1. In contrast, the deletion of the motif
partially reduced the interaction (Figure 3.7 E), suggesting that while
the motif is important, other factors or regions may also play a role in
maintaining the overall interaction between proteins.
IQCB1-CALML3
Similar to the DMI mediating interaction IQCB1-CALM1, it was pre-
dicted that EF-hand domain-containing CALML3 likely recognizes the
same motifs of IQCB1 (Figure 3.7 A). The AF-MM prediction showed
a similar outcome (Figure 3.7 C). The BRET data supports the pre-
diction showing that E85K and the deletion of the motif weakened the
binding with wild-type protein pair (Figure 3.7 F).
181
Figure 3.7: DMIs mediating IQCB1-centric PPIs (A) Schematic illustra-
tion of PPIs mediated by DMIs. The edge ending points towards the predicted
motif, where the arrow implies predicted DMI, while the half-circle points to the
known DMI and gray indicates interaction mediated by different interfaces, where
the question means that this interface was predicted by AF-MM using the fragmen-
tation approach. (Biii a&b) Predicted interface interaction structure of the known
DMI, tandem Eh domain in contact with the third motif in the IQCB1-CALM1
interaction. The structure highlights mutated residues on the domain (in blue) and
on the motif (in green), with arrows pointing to these residues. Ciii a&b) Predicted
interface interaction structure of the known DMI, tandem Eh domain in contact
with the third motif in the IQCB1-CALML3 interaction. The structure highlights
mutated residues on the domain (in blue) and on the motif (in green), with arrows
pointing to these residues. (D) Predicted novel interface of the negative control PPI
using AF-MM fragmentation approach. (E) Experimental confirmation of known
DMIs of CTBP1-CALM1 using BRET saturation assay. (F) Experimental confir-
mation of known DMIs of CTBP1-CALML3 using BRET saturation assay. (G)
Experimental testing of the third motif being involved or not in the interface of the
negative control.
IQCB1-MNS1
One of the negative controls is the interaction of IQCB1 with MNS1. It
has no folded globular region, the monomeric structure of it shows the
protein is composed of long helices. There was no putative DMI returned
using the DMI predictor tool. To test and verify that the motif is not
part of the interface of the interaction with MNS1, I tested the motif
with the removed motif in pair with the wild-type MNS1. Interestingly,
BRET data showed that the deletion of the motif caused an increase in
BRET.
Using the AF-MM fragmentation approach, the predicted model sug-
gests that helices of IQCB1 potentially bind to the C-terminal disordered
region of MNS1, 292-332 (Figure 3.7 A and D). Despite a very high
predictive score (0.89) manual inspection of the predicted interfaces of
182
fragments from the same region shows AF putting the fragments at dif-
ferent sites (not shown). Therefore, this putative interface might be
wrong. The expression data is shown in Appendix 5.4
Experimental validation of interfaces involving PPP3CA in-
teractions
PPP3CA is the phosphatase of type PP3, (its old name is calcineurin or
PP2B) that recognizes its substrates via DOC_PP2B motifs. There are
3 catalytic subunits (PPP3CA, PPP3CB, PPP3CC) and two regulatory
subunits (PPP3R1, PPP3R2). Upon increase in Ca2+ levels, it forms a
complex composed of calcineurin A (catalytic subunit that is dependent
on calmodulin) and a regulatory Ca2+-binding subunit (calcineurin B).
PPP3CA-FAM167A
The motif of DOC_PP2B_PxIxI_1 class in FAM167A (3-9 aa) is per-
fectly predicted by our DMI predictor tool to bind to the calcineurin
(Metallophos) domain in PPP3CA (Figure 3.8 A) AF-MM putative
structure predicts the potential motif forms the contacts along the edge
of two beta sheets in the calcineurin PPP3CA (Figure 3.8 C). The
results showed that mutants on the domain reduced the binding , while
the deletion of the motif of FAM167A completely disrupted the inter-
action, while the expression of the wild-type and mutants were above
the cutoff (Figure 3.8 E). Taken together, it can be suggested that
FAM167A might be a potential substrate for PPP3CA.
PPP3CA-PPP3R2
PPP3CA interaction with PPP3R2 is mediated by different interfaces.
PPP3R2 is the regulatory subunit that binds calcium ions and modulates
the activity of PPP3CA in response to changes in intracellular calcium
levels. PPP3R2 contains EF-hand domains. When intracellular calcium
concentrations rise, calcium binds to these domain repeats in PPP3R2,
inducing conformational changes that activate PPP3CA (Figure 3.8
D).
This PPI serves as a negative control in this study (Figure 3.8
A). BRET signals for the single mutants on the PPP3CA domain did
not affect the interaction, potentially meaning that these residues of
this domain might not contact PPP3R2 (Figure 3.8 F). Microscopy
183
data suggests that the deletion of the motif did not change localization
compared to the wild-type (Figure 3.8 B).
Figure 3.8: DMIs mediating PPP3CA-centric PPIs (A) Schematic illustra-
tion of PPIs mediated by DMIs. The edge ending points towards the predicted
motif, where the arrow implies predicted DMI, while the half-circle points to the
known DMI and gray indicates interaction mediated by different interfaces, where
the question means that this interface was predicted by AF-MM using fragmentation
approach. (B) The localization of wild-type and mutants. Bright-field microscopy
image of U2OS cells showing luminescence (magenta) indicating the presence of
NL-PPP3CA and fluorescence intensity (cyan) of mCit-FAM167A. The images de-
pict the localization of the wild-type proteins (top panel) and the mutant with
the removed motif (bottom panel) relative to the wild-type. Scale bar = 10 µm.
(C) Predicted interface interaction structure of the predicted DMI, tandem Met-
allophos domain in contact with the motif in the PPP3CA-FAM167A interaction.
The structure highlights mutated residues on the domain (in blue) and on the motif
(in green), with arrows pointing to these residues. (D) Predicted known interface of
the negative control PPI using AF-MM fragmentation approach. (E) Experimental
validation of putative DMIs of PPP3CA-FAM167A using BRET saturation assay.
(F) Experimental testing of the domain is involved or not in the interface of the
negative control.
Experimental validation of interfaces involving SPOP interac-
tions
SPOP is the component of RING-based BCR (BTB-CUL3-RBX1) E3
ubiquitin-protein ligase complex that mediates ubiquitination of tar-
geted proteins, leading to proteasomal degradation. It contains two
184
globular domains MATH and BTB domains. Cullin E3 ligase binds to
the BTB domain while the MATH domain directly recruits the sub-
strates of the E3 ligase complex for ubiquitination. In complex with
Cul3, the binding of SPOP to the motif leads to the proteasomal degra-
dation of the substrate.
SPOP-RXRB
The DMI tool predicted the MATH domain might bind to two motifs
of the DEG_SPOP_SBC_1 class at the N-terminal region of RXRB
protein with a DMI match score of 0.899. RXRB also contains four Zn
finger repeats and an LBD domain (Figure 3.9 A). There is no solved
structure of this interaction interface is resolved. The AF-MM model
suggests that the SPOP and RXRB interface is promising, with the
motif docking into a hydrophobic cleft on the SPOP domain (Figure
3.9 C).
BRET experiments involved testing four mutants in SPOP: G132Q
and F102V core mutants significantly reduced binding, whereas S119L
and R70T edge mutants did not affect the interaction (Figure 3.9 H).
Interestingly, both the deletion of the first motif and the deletion of
both motifs resulted in BRET signals similar to the wild-type interac-
tion, indicating that the interaction remained intact (Figure 3.9 H).
The obtained findings suggest that the predicted motifs of RXRB were
not verified with the deletion of the motifs, and the prediction of this
interface is likely to be wrong.
185
Figure 3.9: DMIs mediating SPOP PPIs (A) Schematic illustration of PPIs
mediated by DMIs. The edge ending points towards the predicted motif, where
the arrow implies predicted DMI, while the half-circle points to the known DMI and
gray indicates interaction mediated by different interfaces, where the question means
that this interface was predicted by AF-MM using fragmentation approach. (Bi)
Predicted interface interaction structure of the predicted DMI, where the domain
is in contact with the first motif in the SPOP-RXRB interaction. The structure
highlights mutated residues on the domain (in blue) and on the motif (in green), with
arrows pointing to these residues. (Biii) Predicted interface interaction structure of
the predicted DMI, where the domain is in contact with the second motif in the
SPOP-RXRB interaction. The structure highlights mutated residues on the domain
(in blue) and on the motif (in green), with arrows pointing to these residues. (C)
Predicted novel interface of the negative control PPI using AF-MM fragmentation
approach. (D) Experimental validation of putative DMIs of SPOP-RXRB using
BRET saturation assay. (E) Experimental testing of whether the domain and motif
are involved or not in the interface of the negative control.
SPOP-MYD88
Another interaction partner of SPOP is MYD88. This partner has two
globular domains Death and TIR. Slim DEG_SPOP_SBC_1, has been
detected in region 12-19 (APVSSTSS) of MYD88, with DMIMatchScore
0.55. The DMIMatchScore is below the 0.7 cutoff that we set, therefore
this PPI is treated as a negative control (Figure 3.9 A). Overlapping
fragments that cover the core binding motif are also repeatedly modeled
at the same interface with high confidence, making the interface very
likely to be true. The core motif is likely 13-16 PVSS.
Taking the biological function of the proteins we hypothesized that
this interface might be true. Therefore we tested the previously men-
tioned mutants for domain validation and the core mutants significantly
perturbed the interaction. On the other side, we removed the motif
and N-terminal part of MYD88 and obtained unexpected findings. The
BRET experiments show that the deletion of the N-terminal part led
to the lower BRET, but enhanced the binding affinity (Figure 3.9 I).
186
Based on our observations, we hypothesize that the deletion of the N-
terminal part of MYD88 might changed the spatial rearrangement of
the proteins, increasing the distance between the donor and acceptor
fluorophores or altering their orientation. The increased distance be-
tween tags is indicated by a lower BRET signal. At the same time, this
deletion might increase the accessibility of the binding site for SPOP
leading to enhanced binding affinity. Later I found that the previous
study reported that the co-IP and ubiquitination assay showed that
MyD88-VSSTS mutant still binds to SPOP and can be ubiquitinated
by SPOP at levels comparable with those of wild-type MyD88. More-
over, they reported that an SBClike motif (146-VDSSV-150 aa) located
in the middle of MyD88 is indispensable for MyD88–SPOP interaction
and SPOP-dependent ubiquitination (Li et al. 2020).
Experimental validation of interfaces involving REPS1 inter-
actions
REPS1- NUMB
REPS1 contains tandem repeats of EH domains. EH domains are ex-
clusively found in proteins that function in endocytosis and vesicular
trafficking and are believed to regulate these processes. They recognize
proteins containing single or multiple NPF (Asn-Pro-Phe) motifs, like
NUMB (Figure 3.10 A). In ELM the canonical EH binding peptide is
a strongly conserved NPF motif. NUMB also contains PID and NUMB
domains at the N-terminal and middle part of the protein. This inter-
face is known. Proline and Phenylalanine fit the hydrophobic pocket
on the EH-domain very well according to the predicted AF-MM struc-
ture (Figure 3.10 C). Although BRET experiments demonstrated that
W275A did not significantly affect binding (Figure 3.10 E), the ex-
pression of the L271D mutant was destabilised while being co-expressed
with wild-type NUMB Appendix, Figure 5.4 D. Similarly, the dele-
tion of the motif as single mutants was not expressed well in my hands.
Therefore, it was hard to make any conclusions regarding the interface.
187
Figure 3.10: DMIs mediating REPS1 interactions (A) Schematic illustrating
REPS1 and its partners and their interactions mediated by interfaces. The edge end-
ing points towards the predicted motif, where the arrow implies predicted DMI, while
the half-circle points to the known DMI and gray indicates interaction mediated by
different interfaces, where the question means that this interface was predicted by
AF-MM using fragmentation approach. (B)Predicted interface interaction structure
of the predicted DMI, where the domain is in contact with the second motif in the
REPS1-TRAPPC2L interaction. The structure highlights mutated residues on the
domain (in blue) and on the motif (in green), with arrows pointing to these residues.
(C) Predicted interface interaction structure of the known DMI, where the domain
is in contact with the second motif in the REPS1-NUMB interaction. The structure
highlights mutated residues on the domain (in blue) and on the motif (in green),
with arrows pointing to these residues. (D) Experimental validation of putative DMI
of REPS1-TRAPPC2L using BRET saturation assay. (E)Experimental validation
of known DMI of REPS1-NUMB using BRET saturation assay.
REPS1- TRAPPC2L
It was predicted that EH domains of REPS1 bind to the LIG_EH_1
motif, 112-116 of TRAPPC2L. We had only one prediction and a high
DMI match score of 0.883. AF-MM modeled NPF residues of the motif
fitting in the deep pocket of the domain (Figure 3.10 A). although it
is an interesting prediction and the motif is docked as seen in the known
structure 2JXC (not shown), the confidence score was low 0.6 (Figure
3.10 B).
The L323D mutation on the domain of REPS1 deeper in the pocket
slightly reduced the interaction, while W327A close to the contact with
edge residues of the motif did not affect the interface. The mutants on
the motif, P114G weakend the binding. However, F115G and deletion
of the motif did not disrupt the binding (Figure 3.10 D). It would be
interesting to employ the fragmentation approach and predict the novel
interface.
To sum up, we could test 14 out of 20 selected DMIs across 12 PPIs.
Among 14 tested DMIs, we confirmed both binding regions in 5 DMIs
(CTBP1-DMRTB1, WWOX-CPSF6 (ii), WWOX-CPSF6 (iii), WWOX-
DAZAP2, PPP3CA-FAM167A) and partially confirmed 1 DMI (SPOP-
188
RXRB) for its domain region only out of 7 predicted DMIs. Additionally,
we re-confirmed 5 (CTBP1-TGIF1, CTBP1-IKZF1, WWOX-LITAF(i),
WWOX-LITAF(ii), IQCB1-CALML3 (iii)) and the motif region for 1
DMI (IQCB1-CALM1 (iii)) out of 7 known DMIs. We also tested 7
negative controls mediated by different interfaces and showed that 5
PPIs (CTBP2-CTBP2, WWOX-HOXA1, WWOX-CSNK2B, WWOX-
SNRPC, PPP3CA-PPP3R2) might bind through different interfaces,
while other 2 PPIs (SPOP-MYD88, IQCB1-MNS1) showed are likely
to be wrong and require further investigation to define the interface
between these interactions.
3.4 The application of the strategy of the variant
effect on PPIs
Interaction profile for variants falling in WWOX
As comparative interaction profiling is challenging due to the scarcity
of pathogenic mutations on motifs and the difficulty in crystallizing
disordered regions for functional studies, many reported mutations in
databases like ClinVar are not functionally validated and rely on pre-
dictive tools like PolyPhen-2, which showed limitations in predicting
variant effect (Sahni et al. 2015).
This makes the interpretation of PPI profiling uncertain. We propose
a PPI-centric strategy that incorporates domain-motif interface (DMI)
information that seems to be suitable to better prioritize and interpret
variants, providing a clearer understanding of their impact on protein
interactions and contribution to disease. To showcase the application of
our strategy we characterized the variants selected for the study.
WWOX, a protein involved in neural development and cancer, was
chosen to explore the impact of specific mutations on its interac-
tions. Three mutants￿two VUS and one pathogenic mutation￿were
successfully cloned and experimentally tested (Figure 3.11 A). The
pathogenic mutation, E17K, was documented in ClinVar and found
in patients with developmental and epileptic encephalopathy (DEE).
However, there is no evidence revealing the pathogenicity. The predic-
tion was done by PolyPhen-2. According to AF-MM predicted struc-
tures, this mutation on the interacting WW1 domain is not in contact
with the motif, suggesting it should not disrupt interaction at this site
(Figure 3.11 B). Indeed, experimental data confirmed that the BRET
signal for the WWOX-LITAF interaction remained similar to the wild-
type, indicating no significant impact on this PPI (Figure 3.11 D (i)).
189
E17K also did not affect the interactions between LITAF-CSNK2B,
WWOX-HOXA1 and WWOX-SNRPC interactions (Figure 3.11 D
(iv-vi)). The effect on interactions with DAZAP2 and CPSF6 could
not be observed due to instability issues with the mutant during co-
expression see Appendix, Figure 5.5 B & C, necessitating further
investigation (Figure 3.11 D (iii, V)). Overall, this pathogenic muta-
tion did not disrupt the interface, implying that its clinical impact may
involve other processes or that the mutation is not pathogenic as stated
in ClinVar.
The VUS variant E17D demonstrated a similar interaction profile to
the pathogenic variant E17K, showing no significant impact on interac-
tions with CPSF6 and DAZAP2, though it was not tested with SNRPC
(Figure 3.11 D (ii, iii, v)).
190
A
WWOX
B C
ii
iii iv
D
i ii iii iv v vi
acceptor/donor expr
E
NL-WWOX mCit-LITAF NL-WWOX mCit-LITAF Merged
NL-WWOX-mCit-LITAF
NL-WWOX_H37D-mCit-LITAF
Figure 3.11: The effect of variants falling into the interface using interac-
tion profiling. (A) Schematic illustration of the functional regions within WWOX
with the location of variants, where the color indicates the pathogenic (red) and
VUS (gray). (B) i Predicted interface interaction structure of the WW1 domain
with the first PPSY motif in the WWOX-LITAF interaction. (B) iii Predicted
structure illustrating the second motif on CPSF6 and tandem WW domains. (B)
iii Predicted structure illustrating the third predicted motif on CPSF6 and tandem
WW domains, where the motif is in contact with the second WW domain. (B) iv
The putative model of the motif on DAZAP2 and tandem WW domains. (C) The
schematic PPIs illustrate interaction profiles of wild-type and mutated interaction.
(D) i Experimental assessment of the variants on known DMIs of WWOX-LITAF us-
ing BRET saturation assay. ii Experimental assessment of the variants on putative
DMIs of WWOX-CPSF6 using BRET saturation assay. iii Experimental assess-
ment of the variants on putative DMIs of WWOX-DAZAP2, iv on putative DMIs of
WWOX-HOXA1, v on putative DMIs of WWOX-CSNK2B using BRET saturation
assay. vi Experimental assessment of the variants on putative DMIs of WWOX-
SNRPC using BRET saturation assay. (E) The microscopy experiment shows the
localization of the H37D variant compared to the wild-type WWOX. The intensity
of nanoLuc luciferase tagged LITAF (wild-type and mutant) was shown inverted
and magenta and mCitrine tagged WWOX wild-type was shown inverted and cyan.
scale = 10µm.
In contrast, the VUS variant H37D in WWOX found in patients
with developmental and epileptic encephalopathy and autosomal reces-
191
BRET BRET
sive spinocerebellar ataxia 12 is located within the DMI interface of
DAZAP2, CPSF6, and LITAF. The substitution of histidine with a neg-
atively charged aspartate could disrupt interactions by interfering with a
tyrosine residue within the motifs (Figure 3.11 B). Experimental data
confirmed this, as H37D notably disrupted interactions mediated by this
interface (Figure 3.11 D (i-iii)), while interactions with proteins such
as HOXA1, SNRPC, and CSNK2B remained unaffected (Figure 3.11
D (iv-vi)). These findings suggest that this variant H37D disrupts the
interaction with partner LITAF, DAZAP2, and CPSF6 (Figure 3.11
D (i-iii)). WWOX binds to LITAF, a protein involved in mediating in-
flammatory responses and apoptosis. The ability of the WW1 domain to
bind the motifs in these partners to regulate signaling processes. LITAF
is critical for controlling inflammation and cell death. Such a disruption
could hinder WWOX’s regulatory role, leading to unchecked inflamma-
tory responses or improper cell death signaling, potentially contributing
to disease pathology.DAZAP2 is involved in RNA processing and sig-
naling pathways that regulate cellular differentiation and proliferation.
The interaction with WWOX may help modulate these pathways, ensur-
ing an appropriate cellular response to stress and developmental cues.
The disruption of this interaction might lead to altered RNA processing
or dysregulation of signaling pathways, affecting cellular homeostasis
and potentially contributing to developmental disorders. CPSF6 plays
a crucial role in RNA cleavage and polyadenylation, processes essential
for mRNA maturation. The binding of WWOX to CPSF6 could influ-
ence these processes by modulating RNA metabolism and gene expres-
sion regulation. The H37D variant could prevent proper binding of the
WW1 domain to CPSF6, potentially affecting the function of the CPSF
complex. This disruption might have widespread effects on gene expres-
sion, mRNA stability, and cellular response to DNA damage, which are
critical in neurodevelopmental and neurodegenerative diseases.
This result suggests that profiling variants based on shared interac-
tion disruption and DMI interface impact may be an informative ap-
proach to characterizing candidate disease-associated mutations. How-
ever, we cannot exclude the possibility that some expressed mutants
might be partially misfolded or disrupt PPIs by altering protein com-
partmentalization. To test this, we used microscopy to verify whether
mutant constructs alter localization compared to the wild-type protein
(Figure 3.11 E). Our findings indicate that the H37D variant remains
in the same cellular compartment as the wild-type WWOX, suggesting
that the observed interaction disruptions are not likely due to mislocal-
ization.
192
Interaction profiles of variants found in IQCB1
Mainly, mutations found in IQCB1 are associated with retinal disorders
such as Senior-Loken syndrome 5 and Leber congenital amaurosis 10
(LCA 10). Many variants in IQCB1 are of uncertain significance, and
the available evidence is currently insufficient to determine the definitive
role of these variants in the disease.
For example, the R404G variant has been identified in patients with
Nephronophthisis and other inborn genetic diseases and is classified as
a Variant of Uncertain Significance (VUS). Algorithms developed to
predict the effect of missense changes on protein structure and function,
such as PolyPhen-2 ("Probably Damaging") do not consistently agree on
the potential impact of this missense change. This variant has not been
reported in the literature in individuals affected with IQCB1-related con-
ditions and is not present in population databases (e.g., ExAC shows
no frequency for this variant). The arginine residue (R404) is highly
conserved and is predicted to be positioned deep within the domain of
CALM1/CALML3 (Figure 3.12 B (i)). The change of the residue at
this position might potentially affect the binding affinity and specificity,
disrupting critical protein-protein interactions (PPIs). Another uncer-
tain variant, N406Y (Figure 3.12 A), reported in ClinVar, also falls
within the same motif and can disrupt PPIs similarly (Figure 3.12
B (i)). As expected, experimental data indicate a slight reduction
in binding affinity for both interactions with CALM1 and CALML3
(Figure 3.12 D (i)).
Interestingly, the perturbing effect of these mutants was more pro-
nounced when co-expressed with motif-binding CALML3 (Figure 3.12
D (i, top right)), suggesting a differential impact on binding efficiency
between the two calmodulin-like proteins. This differential impact may
reflect variations in the structural conformation or binding dynamics
between CALM1 and CALML3, which could influence the pathophysi-
ological consequences of these mutations.
IQCB1 is involved in several cellular processes, including cilia func-
tion and protein trafficking, which are crucial for maintaining pho-
toreceptor cell integrity in the retina. The disruption of interactions
with CALM1 and CALML3 due to mutations like R404G and N406Y
could impair calmodulin-mediated signaling pathways, leading to defec-
tive cilia assembly or maintenance. This disruption might contribute to
the pathology observed in retinal degenerative diseases such as Senior-
Loken syndrome 5 and Leber congenital amaurosis 10. Moreover, altered
calmodulin interactions could affect calcium homeostasis and cellular
stress response, further exacerbating disease progression in affected in-
193
dividuals.
A
IQCB1
B C
N406Y
R404G
iii i
ii
D
M110L
E105K
F90L
D94H
acceptor/donor expr acceptor/donor expr
iii ii
E105K
D94H
R87H acceptor/donor expr acceptor/donor expr
iii
iv
acceptor/donor expr
R58W
A89D
Figure 3.12: The effect of variants falling into the motif of IQCB1 using
interaction profiling (A) Schematic illustration of the functional regions within
IQCB1 with the location of variants, where the color indicates the VUS (gray). (B)
i Predicted the interface interaction structure of the Eh domain of the domain in
CALM1 in contact with the third motif. The structure shows the predicted interface
and VUS variants (gray). ii Predicted structure illustrating the second motif on the
Eh domain of the domain in CALM1 in contact with the third motif. The structure
shows the predicted interface and pathogenic (red) with VUS variants (gray). iii The
zoomed-out predicted structure is shown in ii. iv i Predicted the interface interaction
structure of the Eh domain of the domain in CALML3 in contact with the third
motif. The structure shows the predicted interface and VUS variants (gray). (C) i
The schematic PPIs illustrate interaction profiles of wild-type and mutated IQCB1
interaction. ii The schematic PPIs illustrate interaction profiles of wild-type and
mutated CALM1 and CALML3 interactions. (D) i Experimental assessment of
the variants on known DMIs of IQCB1-CALM1 (left) and IQCB1-CALML3 (left)
using BRET saturation assay. ii Experimental assessment of the CALM1 pathogenic
(right) and VUS (left) variants on putative DMIs of IQCB1-CALM1 using BRET
saturation assay. iii Experimental assessment of the CALML3 variants on putative
DMIs of IQCB1-CALML3 using BRET saturation assay.
Calmodulin is an essential calcium-sensing, signal-transducing pro-
tein. Three calmodulin genes, CALM1, CALM2, and CALM3, have
194
BRET BRET BRET
unique nucleotide sequences but encode identical calmodulin proteins
with 4 EF-hand calcium-binding domains. Calcium-induced activation
of calmodulin regulates many calcium-dependent processes and modu-
lates the function of cardiac ion channels. F90L is a pathogenic variant
found in patients with LONG QT SYNDROME 14 and documented in
Clinvar. The substitution occurs at a highly conserved residue between
EF-hand domains II and III. The pathogenicity was not functionally
studied. According to the position of the residue at the hydrophobic
clutch of EF-hand domains (Figure 3.12 B (ii)), the experimental
data showed that F90L disturbed the binding (Figure 3.12 D (ii)),
while the other pathogenic variant E105K located outside of the inter-
face (Figure 3.12 C (ii)) did not have any effect(Figure 3.12 D (ii)).
This variant occurred de novo in a patient submitted for whole exome
sequencing and it does not have functional evidence. Although the ex-
pression of this mutant is very high (see Appendix, Figure 5.7 A),
it might partially destabilize the mutant causing the pathogenic effect.
But it also might mean that the variant is not pathogenic, and the in-
silico analysis reported in ClinVar is incorrect. Although E105 is outside
the direct binding interface, it is part of the hydrophobic clutch that me-
diates the interaction between the EF-hand domains. A disruption here
could impair the coordinated movement and proper orientation of these
domains, reducing the ability of calmodulin to expose the necessary hy-
drophobic patches for binding target proteins effectively.
We also tested the three VUS variants, where M110L in CALM1 was
predicted to be in the domain, and D94H and R87H were predicted to
be outside of the interface (Figure 3.12 D (iii, iv)). Surprisingly,
M110L caused a slight reduction, while the D94H variant was found
in patients with Catecholaminergic polymorphic ventricular tachycardia
4 and Long QT syndrome 14 significantly affected the interaction with
CALM1 (Figure 3.12 D (ii)). The VUS A89D in CALML3 also showed
the effect on interaction with IQCB1 (Figure 3.12 D (iii)).
Interaction profile for variants falling in SPOP
We also tested pathogenic variants detected on the MATH domain
of SPOP on the interaction with its partners RXRB and MYD88
(Figure 3.13 B (i-ii)). As expected the mutants perturbed the interac-
tion with the predicted motif on RXRB (Figure 3.13 D (i)). Interest-
ingly, the Y87C variant did not affect BRET with MYD88 (Figure 3.13
D (ii)). However, the predicted interface might not be correct, as it was
previously shown in literature and in this study, it was expected that
these mutants might not be on the correct interface with MYD88 and
195
have no effect on binding. In agreement with this assumption, the VUS
variants also did not change the interaction with MYD88.
A
SPOP
B C
ii
G132V
Y87C
ii iii
P13R
G132V
Y87C
D
C
ii iii
acceptor/donor expr acceptor/donor expr acceptor/donor expr
Figure 3.13: The effect of variants falling into the motif of SPOP using
interaction profiling. (A) Schematic illustration of the functional regions within
SPOP with the location of variants on MATH domain, where the color indicates
the pathogenic (red) variants. (B) i Predicted the structure of the MATH domain
of the domain in SPOP in contact with the motif of RXRB. ii Negative control
interaction SPOP-MYD88 using the novel interface using fragmentation AF-MM
approach. The predicted model shows the MATH domain predicted to bind to the
N-terminal motif on MYD88 with the pathogenic variants on the MATH domain.
iii The same structure with VUS variant on a predicted motif in MYD88. (D) i
Experimental assessment of the variants on known DMIs of SPOP-RXRB using
BRET saturation assay. ii Experimental assessment of the effect of pathogenic
variants on SPOP on the interaction with MYD88. iii Experimental assessment of
the effect of VUS variants on moti of MYD88 on the interaction with SPOP.
196
BRET
The effect of variants sitting on motif
In addition, we tested successfully cloned mutants on the motifs of part-
ners of our candidate partners LITAF, IKZF1, DMRTB1, FAM167A,
DAZAP2, CPSF6 and TRAPPC2L. VUS variants found close to the
first PPSY and on the second PPSY of LITAF (Figure 3.14 B (i-ii))
do not disrupt the interactions with WWOX (Figure 3.14 C (i-ii)).The
lack of effect observed with the mutants could be due to the nature of
the substitution; it may not be significant enough to affect the binding
affinity between the two proteins, thereby failing to cause a noticeable
disruption in the interaction. Further experiments could help clarify the
extent of this variant’s impact on different protein partners.
Additionally, this interaction is maintained by two PPSY motifs and
the WW-1 domain, which can compensate for the loss of a single contact
point, masking the effect of certain variants. It would be interesting
to test the perturbation effect of this VUS on interactions with other
partners mediated by the same DMI but only one interface to determine
if the variant disrupts those interactions or keeps them similarly intact.
In addition, the BLI experiment showed that mutant Y61D was localized
similarly to the wild-type.
197
A
LITAF
B C
A19T
P17L
ii ii
P58L
Y61D
P59R
D acceptor/donor expr
NL-WWOX mCit-LITAF NL-WWOX mCit-LITAF Merged
NL-WWOX-mCit-LITAF
NL-WWOX-mCit-LITAF_Y61D
Figure 3.14: The effect of variants falling into the motif of LITAF using
interaction profiling. (A) Schematic illustration of the functional regions within
LITAF with the location of variants on motifs, where the color indicates the VUS
variants. (B) i Predicted the structure of the WW-1 domain and recognized the
first PPSY motif on LITAF. It also shows VUS variants situated close to the motif.
ii Predicted the structure of the WW-1 domain and recognized the second PPSY
motif on LITAF. It also shows VUS variants situated close to the motif. (C) i
Experimental assessment of the VUS LITAF variants on known the first motif using
BRET saturation assay. ii Experimental assessment of the VUS LITAF variants on
known the second motif using BRET saturation assay. (D) The BLI experiment
tested the localization of H37D variant compared to the wild-type WWOX. The
intensity of nanoluc luciferase tagged LITAF (wild-type and mutant) was shown
inverted and magenta and mCitrine tagged WWOX wild-type was shown inverted
and cyan. The images of both interacting proteins are merged.
The variant located on the flanking regions close to the predicted
motif of IKZF1 (M31V, S41L) showed a slight effect. VUS variants found
on the motifs of FAM167A and DMRTB1, e.g. (R178H (Figure 3.15 B
(i-ii)) and V8M (Figure 3.15 C (ii-ii)) showed lower BRET compared
to wild-type (Figure 3.15 B (iii) and C (iii))). On the other hand,
198
BRET BRET
VUS located away from the motifs showed similar BRET results as wild-
type interactions (Figure 3.16).
A B
IKZF1 DMRTB1
ii iii ii iii
R178H
D55
acceptor/donor expr acceptor/donor expr
C NL-CTBP1 mCit-DMRTB1 NL-CTBP1 mCit-DMRTB1 MergedDFAM167A
NL-CTBP1-mCit-DMRTB1
ii iii NL-CTBP1-mCit-DMRTB1_R25H
V8M
ii
NL-PPP3CA mCit-FAM167A NL-PPP3CA mCit-FAM167A Merged
NL-PPP3CA-mCit-FAM167A
NL-PPP3CA-mCit-FAM167A_V8M
acceptor/donor expr
Figure 3.15: The effect of variants falling into the motif of IKZF1, DM-
RTB1 and FAM167A using interaction profiling. (A) i Schematic illustration
of the functional regions within IKZF1 with the location of VUS (gray) variants on
motifs. ii Predicted the structure of the CTBP1 domain and recognized the first
PEDLS motif in IKZF1. It also shows VUS variants situated close to the motif.
iii Experimental assessment of the VUS variants on a known motif in IKZF1 using
BRET saturation assay. (B) i Schematic illustration of the functional regions within
DMRTB1 with the location of VUS (gray) variant on the motif. ii Predicted the
structure of the CTBP1 domain and recognized the first PLDLR motif on DMRTB1.
It also shows VUS variant situated close to the motif. iii Experimental assessment
of the VUS variant in putative motif in DMRTB1 using BRET saturation assay.
(C) i Schematic illustration of the functional regions within FAM167A with the lo-
cation of VUS (gray) variant on the motif. ii Predicted the structure of the CTBP1
domain and recognized the first PLDLR motif on DMRTB1. It also shows VUS
variant situated close to the motif. iii Experimental assessment of the VUS variant
on a putative motif in FAM167A using BRET saturation assay.
Here we evaluated the effect of variants on the PPIs mediated by
domain-motif interfaces. We showed that variants located within the
motif region can disrupt interactions, potentially altering the function
of these interactions and contributing to disease development. Moreover,
mutations near the motif region may also slightly affect the interface,
potentially disrupting the biological processes mediated by these inter-
actions.
199
BRET
BRET
BRET
A B
DAZAP2 CPSF6
ii ii iiiiii
P383A
Y46C
P383A
acceptor/donor expr acceptor/donor expr
C
TRAPPC2L
ii iii
acceptor/donor expr
Figure 3.16: The effect of variants falling into the motif of DAZAP2,
CPSF6 and TRAPPC2L using interaction profiling. (A) i Schematic illus-
tration of the functional regions within DAZAP2 with the location of VUS (gray)
variant close to the motif. ii Predicted the structure of the WW-1 domain and rec-
ognized the predicted motif in DAZAP2. It also shows VUS variant situated close
to the motif. iii Experimental assessment of the VUS variant on a predicted motif in
DAZAP2 using BRET saturation assay. (B) i Schematic illustration of the functional
regions within CPSF6 with the location of VUS (gray) variant close to the third mo-
tif. ii Predicted structure of the WW-1 domain and recognized the third motif on
CPSF6. It also shows VUS variant situated close to the motif. iii Experimental
assessment of the VUS variant in putative motif in CPSF6 using BRET saturation
assay. (C) i Schematic illustration of the functional regions within TRAPPC2L with
the location of VUS (gray) variants close to the motif. ii Predicted the structure
of the Eh domain in REPS1 and recognized putative motif in TRAPPC2L. It also
shows VUS variant situated close to the motif. iii Experimental assessment of the
VUS variants on a putative motif in TRAPPC2L using BRET saturation assay.
.However, not all mutations within the interface necessarily disrupt
the interaction. Some residues, even when mutated, may not signifi-
cantly alter the interface if the substitution does not substantially change
the binding strength. Additionally, some interactions may be stabilized
by multiple interfaces, which can compensate for the loss of a single con-
tact point, masking the effect of certain variants (e.g. LITAF-WWOX).
On the other hand, the disruption of the interaction might be caused
by partial folding or mislocalization of the mutant, rather than direct
interference with the binding interface. Therefore, additional studies are
200
BRET
BRET
BRET
needed to confirm these possibilities and to determine whether observed
disruptions are due to changes in protein structure or localization.
Overall, our findings indicate that integrating interaction disruption
profiles with DMI interface information can enhance our understanding
of variant effects in the context of PPI interactions.
This combined approach allows for a more nuanced characteriza-
tion of variants, potentially leading to better identification of disease-
associated mutations and providing deeper mechanistic insights into
their role in disease pathology. However, considering the complexity
and number of interfaces that mediate interactions, the diverse biolog-
ical processes they influence, structural conformations and the specific
properties of each amino acid at the contact sites, and the residue it is
changed to this strategy can be further refined to achieve more accurate
and controlled results.
201
Chapter 4
Conclusion and future perspectives
4.1 Deciphering protein interaction interfaces using
DMI predictor tool
The development of the DMI tool and its application to HuRI annotate
about 3200 protein-protein interactions (PPIs) with high-confidence pu-
tative DMI interfaces (see Chapter 3 section 3.3.), providing valu-
able insights into the mechanistic functions of these interactions. This
advancement has greatly enhanced our ability to understand how spe-
cific mutations might disrupt interactions, aiding in the characteriza-
tion of variants found in patients. By analyzing how a variant perturbs
PPIs, we can hypothesize its potential contribution to the development
of disease symptoms or aetiology. Such hypotheses can then be tested
through downstream experiments, which is crucial for the advancement
of precision medicine.
Despite these advancements, there is still room for improvement in
the performance of the DMI tool. One issue is its inability to distinguish
between repetitive tandem domains, (e.g. RR1 and RR2), which often
appear sequentially within proteins and may serve different functions in
mediating interactions. Incorporating domain-specific annotations and
functional classifications can help differentiate between tandem repeats
by considering their unique roles and sequence patterns. Advanced pat-
tern recognition methods and contextual analysis can refine sequence
analysis.
Another limitation identified during manual analysis is that some pre-
dicted DMIs did not meet the cutoff due to low IUPred scores, despite
the motifs being disordered. This issue is likely due to the window-based
nature of IUPred, where regions adjacent to folded segments are often
predicted as folded. Therefore, enhancing the window size or incorporat-
ing additional prediction tools could improve the tool’s ability to detect
202
likely true motifs, for example, AF-MM.
Our findings also showed that the variants found on flanking regions
of the motif can also slightly affect the interactions, suggesting the po-
tential involvement of these regions in maintaining the interface of a
PPI. This insight can be integrated into the refinement of the poten-
tially functional regions and variant effect characterization, helping to
refine the understanding of how flanking regions contribute to interface
stability and potentially influencing the assessment of variant impacts
on protein interactions (Luck et al. 2012).
Furthermore, with the recent update of the ELM database, which has
enriched SLiM classes with new instances, re-running the DMI tool using
this updated dataset could significantly enhance prediction accuracy and
outcomes. In addition, our predicted and experimentally validated data
can
4.2 The application of DDI predictor and Al-
phaFold to map the PPI data with interaction
interfaces
There is an overwhelmingly large number of PPIs that are not mapped
with any known interfaces pointing to the fact that many interface types
remain still uncovered, especially those involving motifs (Rolland et al.
2014; Tompa et al. 2014).
To detect these interfaces, the AF-MM approach can be employed
to identify novel interfaces, which can then be mapped onto PPIs and
overlapped with mutation data, as demonstrated in Chapter 2, Article II.
All in all, using AF-MM to discover novel interfaces holds great potential
as it bypasses the need for a reference list of interface types for interface
searching. However, scaling this approach for higher throughput will
require further development.
Another type of interface that was not mapped in this study is a
domain-domain interface (DDIs). Given the more stable nature of folded
domains and the interactions they mediate, structural information on
DDIs is more abundant compared to DMIs. For example, the 3did
database extensively catalogs DDIs (Geist et al. 2024). Our lab has
assessed the quality of DDIs in 3did providing us with important insights
regarding features that can aid in scoring predicted DDIs for their abili-
ties to mediate PPIs (Geist et al. 2024). Incorporating these insights
can improve the mapping of the PPI dataset with DDIs that help to
interpret the effect of variants on protein function.
203
4.3 Enhancing Predictive Accuracy of Variant Ef-
fects and Mutation Design through Positioning
on Predicted AF-MM Interface Structures
The application of AF-MM to predict novel interfaces, for which the
resolved structures are not available, significantly aided in understand-
ing how well the putative motif fits into the binding pocket through
visualization. This process allows for the detection of residues in close
contact, the assessment of the structural location of mutated residues,
and the design of mutants for experimental validation. While the struc-
tural information can give insight into the predicted interfaces and help
in variant characterization, the manual inspection of predicted struc-
tures and the localization of variants is time-consuming. To address
this limitation, my colleague is currently working on applying AF-MM
to the entire set of DMIs with overlapping pathogenic variants of VUS
mutations to analyze and implement the structural information.
4.4 Improvement of the BRET assay to validate the
predicted interfaces
While the medium-throughput cloning pipeline and BRET assay de-
veloped in this study have been valuable for validating predicted in-
teraction interfaces, several steps within the pipeline could be opti-
mized to enhance both efficiency and accuracy. Currently, the manual
picking of colonies is a bottleneck in the current plate-based medium-
throughput pipeline, requiring substantial time and labor to select indi-
vidual colonies for inoculation. Implementing automated colony pickers
could address this issue by handling multiple colonies simultaneously
with higher precision, thereby speeding up the workflow and reducing
the risk of contamination or human error.
Using BRET assay we detected about 50% of protein-protein inter-
actions from HuRI. Although this detection rate aligns with previous
studies or was even higher, there is still room for improvement. Ini-
tially, we cloned fusion proteins exclusively at the N-terminal, based on
the observations that the expression is better at the N-terminal. Trepe
et al (2018). However, Trepte et al. have demonstrated that testing
protein pairs in various configurations increases detection rates while
maintaining low false detection rates (Trepte et al. 2018; C. Trepte
S. et al. 2021). They also showed that tagging the proteins close to the
interaction interface might improve PPI detection. While cloning tags
204
in different configurations or close to the interface could enhance the
detection of PPIs, this approach might also increase the time required
for cloning.
Additionally, choosing a more sensitive fusion tag can enhance the
detection capability of the BRET assay regardless of the tag’s position
relative to the interaction interface. For example, using tags with higher
quantum yields or those that offer better resonance energy transfer ef-
ficiencies can lead to stronger and more reliable BRET signals. For
example, using mNeonGreen as an acceptor fluorophore in BRET as-
says significantly increased the dynamic range and sensitivity compared
to traditional GFP derivatives (Shaner et al. 2013). These more sen-
sitive tags could improve detection sensitivity even if the tag is not in
the optimal position relative to the interaction interface.
The NL and mCit fusions used in the BRET assay allowed us to
monitor the expression levels of wildtype and mutant constructs, which
is important to rule out loss of binding because of a destabilization of the
protein. However, we cannot exclude the possibility that some expressed
mutants might still be partially unfolded or mislocalized and thus, some
loss of binding detected in our study could be unspecific and not the
result of a specific perturbation of the predicted interface (Lacoste et
al. 2023).
Advanced imaging techniques could be scaled up and integrated into
the workflow to assess whether mutant proteins are mislocalized. This
approach would help determine if the observed binding loss is due to
the mutant proteins being in the incorrect cellular compartment rather
than a direct effect on the interaction interface. While I attempted to
implement BRET-based bioluminescence imaging (BLI) to test whether
the localization of a mutant is not changed compared to the wild-type
protein, we faced challenges in the setup of experimental steps that
needed to be optimized for robust quantitative analysis. This optimiza-
tion involves the finding optimal amount of cells for seeding, the DNA
ratio for more efficient transfection, the concentration of the transfec-
tion agent as well as downstream analysis involving defining regions of
interest (ROIs) for specific cellular compartments, using either manual
methods or automated segmentation algorithms. This approach allows
for assessing whether the mutant proteins overlap with these markers
and determining any shifts in localization. Further statistical analysis
would ensure that any observed differences are significant and not due
to random variation. Implementing these strategies will help confirm if
localization changes contribute to the observed effects, thereby provid-
ing a more accurate interpretation of the impact of mutations on protein
function.
205
Along with mislocalization studies, BRET-based imaging can be
used for the detection of a single BRET within the cell (Dragulescu-
Andrasi et al. 2011; Kobayashi et al. 2019). The determination
of BRET per cell might enhance the precision of interaction studies by
providing detailed insights into how individual cells contribute to the
overall interaction dynamics. Quantifying BRET signals per cell allows
for a more granular analysis of the interactions, potentially revealing
variations in PPI strength and localization that may be masked in bulk
measurements. Moreover, BRET-based microscopy can be applied not
only in mammalian cells but also in tissues and in vivo animal models
(Dragulescu-Andrasi et al. 2011). Kobayashi et al. (2019) demon-
strated the use of BRET-based imaging to monitor protein interactions
and subcellular localization in live animal tissues. Their study empha-
sized that BRET, with its enhanced dynamic window due to reduced
background signals, is particularly effective for detecting subtle changes
in protein interactions. They also illustrated the quantification of BRET
signals, including the dissociation of protein complexes and redistribu-
tion within cellular compartments. For instance, they used manual seg-
mentation and pixel-by-pixel analysis to quantify BRET signals from
specific subcellular regions, revealing significant changes in protein in-
teractions upon receptor activation. This quantitative approach enabled
precise measurements of BRET signal changes, facilitating detailed in-
sights into dynamic biological processes such as receptor endocytosis and
protein localization in vivo. However, it was done on a small scale. The
development of a plate-format scalable BRET-based BLI pipeline has to
be addressed.
4.5 General outlook
This thesis has proposed a strategy driven by prediction and experimen-
tal validation of domain-motif interfaces and integrating this information
to interpret the effect of uncharacterized variants on protein function.
In doing so, we have gained profound insights into the intricate inter-
play between different functional modules, such as domains and motifs
in proteins that facilitate their interactions. Moreover, using this strat-
egy we provided experimental evidence and structural information on
the effect of variants falling into DMIs mediating protein-protein inter-
actions. This information can be explored in future studies aimed at
delineating potential molecular mechanisms causing disease.
Given the useful mechanistic insights that prediction tools like the
DMI predictor tool can provide, I expect the optimization and applica-
206
tion of these tools (DDI predictor and AF-MM) in mapping PPI with
interfaces to bring us closer to a fully structurally annotated human pro-
tein interactome mapped with interfaces. Moreover, I anticipate greater
inclusion of interface information in experimental workflows, where this
will help generate hypotheses to guide experiments and aid in variant
characterization.
Binary interaction assays like BRET have proven to be suitable
tools for validating PPI interfaces, but there are still several ways to
further enhance their capabilities in characterizing variant effects on
PPIs(Dragulescu-Andrasi et al. 2011; Kobayashi et al. 2019). In
addition to expanding the power to systematically assess the effects of
variants on protein-protein interactions (PPIs), it is crucial to implement
systematic downstream steps (e.g., reporter assays, cell proliferation,
apoptosis assays) to gain deeper insights into how these variants impact
biological processes. By integrating these additional steps, researchers
can move beyond just identifying whether a variant disrupts a specific
interaction and start understanding the functional consequences of these
disruptions within the context of cellular pathways and networks. I an-
ticipate seeing the advancements of this assay in this direction.
207
Chapter 5
Appendix
5.1 Protocols
5.1.1 The medium-throughput cloning protocol
208
 
Medium-throughput GATEWAY cloning protocol 
 
 
Data organization in MySQL DB cloning_data 
Every HTP cloning project should have a bioinformatician assigned to it who helps with putting the 
data in the tables. Everything that the experimentalist can do on his/her own should not be done by 
the bioinformatician. 
 
table project_descr: 
column_name content 
project_id e.g. CL01 
experimenter e.g. Christian 
bioinformatician e.g. Eric 
descr  e.g. cloning project for XL-MS project, more PRS pairs, every ORF cloned in N-ter NL 
  and N-ter mCit vector 
date_started e.g. 2022-10-11 
 
table orf_pairs: 
column_name content 
project_id e.g. CL01 
orf_a  e.g. 49583 !"#$"%&'"()*+",$"-".-#/"-/'"#0"0,".-/%#123-/",/4'/"56".2%"+7-33'/",/$8#4"-+"
" " ,/$8-9",%&'/:#+'"4'+1/#;'",/4'/",$"()*+"#0"4'+1/"1,3270"#0"./,<'1%84'+1/"%-;3' 
orf_b  e.g. 98584 
 
table entry_clone_info: 
column_name  content 
orf_id   e.g. 49583 
orf_len_nt  e.g. 1980 
entry_plate_id  e.g. GDEh81001 !".3-%'"0-7'"$/,7"()*',7' 
entry_well_id  e.g. A01 !":'33"=>"$/,7".3-%'"$/,7"()*',7' 
entry_inoc_plate_id e.g. CL01GEh_01 
entry_inoc_well_id e.g. C10 
pcr_amplicon  0 or 1 !"?"#$"@A)"./,421%"3,,B+"C,,4",0"C'39"D",%&'/:#+' 
comments  space to leave additional comments for an ORF if needed 
 
table expr_clone_info: 
column_name  content 
orf_id   e.g. 49583 
expr_plasmid_id e.g. KL_11 
expr_plasmid_name e.g. pcDNA3.1 cmyc-NL-GW 
LR_plate_id  e.g. CL01LR_01.1 
LR_plate_well  e.g. C10 
colonies  0 or 1 !"?"#$"%&'/'":'/'"1,3,0#'+"+'3'1%'4"$,/".#1B#0C9"D",%&'/:#+' 
expr_plate_id  e.g. CL01GExh_01.1 
expr_plate_well  e.g. B05 
MP_elu_plate_id this and the next 3 columns only need to be filled if rearray occurred 
MP_elu_plate_well 
expr_dil_plate_id 
expr_dil_plate_well 
DNA_conc_ng_ul e.g. 235 
theor_DNA_conc_dil e.g. 100 
seq_confirmed_bb_fw 0 or 1 
seq_confirmed_bb_rv 0 or 1 
seq_confirmed_full_length 0 or 1 
comments  space to leave additional comments for an ORF if needed 
1 
209
 
 
Material List: 
In separate excel sheet with calculator for amounts 
- you can find this checklist on the: 
C/,2."4/#E'!"FG@813,0#0C84-%-!"%'7.3-%'+! CL00_checklist_consumables  
Prior to start of cloning 
Computational part 
- H":''B+"#0"-4E-01'I"C'%"13,0#0C"./,<'1%"=>";J"1&'1B#0C",0"C/,2."4/#E'"#0"FG@813,0#0C84-%-"
$,34'/":&-%"%&'"3-+%"13,0#0C"=>":-+9"#01/'7'0%";J"?"1,20%9"#K'K"#$"3-+%"13,0#0C"=>":-+"A3D?"!"
7-B'"A3DH 
- 2 weeks in advance: design and discuss with Katja and bioinformatician for cloning project 
plate layout for ORF inoculation plates for Day 1 and plate layout for inoculation plates of 
picked LR transformants for Day 4 (the way the plates should be organized, code can be 
written but the rearray is only at day 4 possible) 
- consider to leave a well free for the water control for the PCR and if and which 
controls you would like to have for LR 
- >'+#C0"3-;'3+"$,/"J,2/".3-%'+"!"%,"+''"-0"'L-7.3'"$,/"&,:".3-%'"3-;'3+"+&,234";'"4'+#C0'4"
-04"$,/":&".3-%'+"3-;'3+"-/'"0''4'4"!"%-B'"-"3,,B"-%"%&'"$#3'"
FG@813,0#0C8.3-%'83-;'3+8+1&'7-%#1K.4$"!"%&'0"+%-/%"$/,7"%&'".3-%'"3-;'3+"$/,7"-"./'E#,2+"
13,0#0C"./,<'1%";J"7-B#0C"-"1,.J",$"%&'".3-%'83-;'3+K%L%"$#3'"$/,7"-"./'E#,2+"13,0#0C"./,<'1%"#0%,"
%&'"$,34'/",$"J,2/"0':"13,0#0C"./,<'1%",0"%&'"C/,2."4/#E'"-04"7,4#$J"-11,/4#0C3J"!"#$"J,2"-/'"
0,%"+2/'9"@MNOPN9"-+B"!"#$".3-%'"3-;'3+"-/'":/,0C9"1,7.2%-%#,0-3"-+":'33"-+"'L.'/#7'0%-3"
+%'.+",$"J,2/"13,0#0C"./,<'1%"1-0"C,":/,0C 
- Print the plate labels with the help of Mareen (a template for the labels can be found on group 
drive, HTP_cloning_data, templates, HTP_plate_labels; please also read the explanation how 
to print these labels) 
- Q'%".3-%'"3-J,2%+":#%&"%&'"&'3.",$"-";#,#0$,/7-%#1#-0"!"%&'"+1/#.%+"%,"4'+#C0"%&'".3-%'"3-J,2%+"
#4'-33J"0''4"#0$,/7-%#,0"-;,2%"%&'".3-%'"3-;'3+ 
 
 
Experimental part: 
- O;,2%"R":''B+"#0"-4E-01'"./'.-/'"1,7.'%'0%">FS-3.&-"N"A,3#"1'33+"!"%-3B"%,"T-/''09"%&'"
./'.-/-%#,0"#%+'3$"%-B'+"?":''B 
- 2-4 weeks in advance do maxi, midi or miniprep of empty expression vectors and plasmids for 
LuTHy assay (KL_01, KL_02, KL_03, KL_06, KL_07, KL_11, KL_247) 
- At least 2 weeks in advance, make a copy of the excel sheet with the list of reagents and 
save it in your cloning folder on the group drive, calculate your amounts and check that 
everything is available 
- 2 weeks in advance check the amount of  
- PCR plates (order no.: 781352 from Brand) 
- PCR foil 
- Reservoir for LB medium (order no.: HT69.1 from Carl Roth) 
- Costar plates (order no.: 3799 from Corning) 
- Microplate aluminum sealing tape (order no.: 6570 from Corning) 
2 
210
- Adhesive gas permeable seals (order no.: AB-0718 from Thermo Scientific) 
- Combitip advanced 1ml (order no.: 0030089.430 from Eppendorf)  
- Qtray with lid and divider (square plates for Agar; order no.: MLDVX6029 from VWR 
international GmbH) 
- E-Gel 96 1%Agarose (GP) (check the expiring date; order no.: G700801 from 
Invitrogen) 
- E-Gel 96 High range DNA marker (order no.: 12352019 from Invitrogen) 
- Steril/autoclaved 2ml Deepwell plates, 96 round wells (order no.: E2896-2110 from 
Starlab) 
- Qiaprep 96 Plus MiniPrep Kit (order no.: 27291 from Qiagen) 
- QIAvac 96 (vacuum system needed for the MiniPrep; order no.: 19504)  
- 1250 µL (blue) integra grip tips for a digital multichannel pipette 
- 125 µL (yellow) integra grip tips for a digital multichannel pipette 
- 12,5 µL (pink) integra grip tips for a digital multichannel pipette 
- 1 week in advance - get familiar with  
- The digital multichannel pipette  
- All other equipment you will need 
- The excel sheet to calculate the amounts 
- SQL database 
- LuTHy assay transfection template 
- The scripts for the different steps 
- 1 week in advance take all needed consumables on your bench or -20°C 
- 1 week in advance - check the amount of: 
- 40% glycerol (sterile) 
- Proteinase K (2µg/µl) 
- HF PCR polymerase and buffer (from Protein Production CF) 
- dNTPs (NEB freezer at IMB) 
- 96 gel loading buffer (Homemade, recipe:10mM Tris-HCl, 1mM EDTA, 0,005% 
bromophenol blue) 
- LR clonase (from Protein Production CF, should be stored at -80°C or better -150°C) 
- SOC medium (~8ml/plate) 
- Needed antibiotic  
- Ampicillin (100mg/ml) 
- Kanamycin (30mg/ml)  
- Spectinomycin (50mg/ml) 
- LB medium 
- LB-Agar (250ml/square plate) 
- Sterile glass plating beads 
- Sterile toothpicks 
- SOC medium  
- At least 1 week in advance, order sequencing barcodes for the plates (Starseq) 
- Between 1 and max up to 5 days in advance, prepare square plates with agar 
- At least 1 day in advance, sort ORFeome plates in new rack  
- Do this step with one additional person as helper 
- work with blocks of 7 plates, because they fit as one block in the rack 
- Presort your ORF plates into a new rack; you will need ~1h per 10 plates (including 
time to let the -80 come back to temperature): Take out rack from -80 freezer, close 
freezer, sort out the plates needed in a box with dry ice. Put the rack back in the 
freezer and sort the plates into a new rack according to the order you will pick from 
them. Let the freezer get back to -80˚C before you go for the next batch of plates. 
- fill PCR protocol for X-reactions with calculations  
- Stock SOC medium  
- Cut small pieces of Alu foil for resealing of plates 
- Aliquot expression vectors in PCR stripes 
- Dilute primer for PCR in 1,5ml eppi 
- Dilute and aliquot primer for sequencing in PCR stripes  
- after LR we are sequencing with forward and reverse primer at the same time 
- after running the sequencing pipeline you will see for which ORF you need to design 
primers for full length sequencing 
- Forward primer:  
3 
211
- For N-terminal NL-fusion: primer #44 NanoLuc-398fwd 
(GAACGGCAACAAAATTATCGAC) 
- For N-terminal mCit-fusion: primer #47 mCitrine-547fwd 
(AGCAGAATACGCCCATCG) 
- Reverse primer: 
- If there is no C-terminal fusion: 
primer #51 pEXP_rev (GGCAACTAGAAGGCACAGTC) 
 
 
Overview of the plates 
 
Step Plate label (example) Type of plate 
Inoculation plate CL01GDEh_01 Costar plate (#3799) 
PCR plate CL01PCR_01 PCR plate (#781352) 
Gel plate CL01Gel_01 PCR plate (#781352) 
LR plate CL01LR_01.1*, CL01LR_1.2* PCR plate (#781352) 
Transformation plate CL01TR_01.1*, CL01TR_1.2* PCR plate (#781352) 
Agar plate CL01TR_01.1a / CL01TR_01.1b,  Qtray (#MLDVX6029) 
CL01TR_01.2a / CL01TR_01.2b 
Deepwell inoc plate CL01GExDW_01.1a / CL01GExDW_01.1b Deepwell plate (#E2896-
CL01GExDW_01.2a / CL01GExDW_01.2b 2110) 
Glycerolstock plate CL01GEx_01.1, CL01GEx_01.2 Costar plate (#3799) 
MiniPrep elution CL01GExMP_01.1, CL01GExMP_01.2 Costar plate (#3799) 
DNA dilution plate CL01GExDil_01.1, CL01GExDil_01.2 PCR plate (#781352) 
DNA Database plate CL01GExSt_01.1, CL01GExSt_01.2 PCR plate (#781352) 
DNA sequencing plate CL01GExSF_01.1, CL01GExSF_1.2 PCR plate (#781352) 
(forward and reverse) CL01GExSR_01.1, CL01GExSR_1.2 
 
All plates should be labeled at the left side (having A1 top left corner) 
*where applicable plates labelled x.1 contain NL fusion constructs, plates labelled x.2 contain mCit 
fusion constructs 
 
 
4 
212
 
 
Day 1 Picking and inoculation of ORFs (~1.5h just picking) 
 
Checklist: 
● 70% EtOH and tissues - to sterilize the plates from the Orfeome collection 
● 50 mL falcon tube - to prepare mix of LB medium with corresponding antibiotic  
● Tips - for picking up the ORF from the collection plate  
● Alu foil cut in small pieces (size of one well) to close the opened wells with ORF  
● 50 mL serological pipette 
● Pipette boy  
● 100 µL pipette 
● Pipette tips  
● 96-well, costar plate (Corning,#3799) - for the inoculation of the ORFs 
● Adhesive gas permeable seals (order no.: AB-0718 from Thermo Scientific) 
● Multichannel pipette with tips  
● LB medium (200 µL per well) 
● Antibiotic (Kanamycin or Streptomycin - 0,2µl/well) 
● Dry ice box - for keeping the plates from the ORFeome collection while picking the ORFs 
 
 
Do the following steps with one, better two additional people as helpers! 
 
steps: 
1. Use aseptic bench working technique 
2. Label the inoculation plate (CL01GDEh_01) 
3. Prepare a master mix of LB medium and antibiotic in a 50ml Falcon and vortex 
a. 200µl LB medium/well 
b. Pay attention to which ORF needs which antibiotics 
c. 1:1000 mixing of antibiotic to LB medium (i.e. 1µL into 1000µL of LB medium) 
4. Prepare the reservoir for the LB medium 
5. Pour the antibiotic LB mix in the reservoir 
6. Use the 300µl multichannel pipette to distribute 200µL of antibiotic LB mix to each well in the 
96-well plate 
7. Take out the box with dry ice and put the first 7 plates with the needed ORFs on it  
a. make small stacks to keep cold 
8. Work as fast as possible on dry ice here  
(working with 3 people simplifies the process, 1. person is taking the plates out, 2. 
person is picking the ORFs, 3. person is controlling) 
a. Disinfect the alu foil of the orfeome plate 
b. With the tip/toothpick make hole in the selected well 
c. Take another tip and scratch the ORF 
d. Then put it into the well of the inoculation plate with LB/Antibiotic medium, stir for a 
few seconds and discard the tip 
e. Immediately close the hole with the pre-cut alu foil pieces  
9. Repeat steps 8b - 8e for each well to pick from a plate 
10. Move to the next plate until done with the first batch then go back to step 7 
11. Seal the inoculation plate with the air permeable adhesive seal  
12. Incubate the plate overnight at 37 ˚C, 190rpm,  
a. cover with a paper box (to reduce evaporation) 
b. Incubator in the Niehrs lab 
5 
213
 
 
Day 2 PCR, Glycerolstock, E-Gel and LR reaction 
 
Checklist PCR: 
● 96-well skirted PCR plate - for PCR reactions 
● 50 mL falcon tube - to prepare PCR master mix, 50ml because of multipette 
● Alu foil - to cover the glycerolstock plate 
● PCR foil  to cover the PCR plate 
● Multichannel pipette 
● Multipette and combitips 1ml 
● 100 µL (yellow) integra pipette tips special for a digital multichannel pipette 
● 10 µL (pink) integra pipette tips special for a digital multichannel pipette 
● 10ml reservoir - to pipette the reaction and transfer to the PCR plate 
● Ice block - to keep PCR components in cold 
● 40% glycerol stock ( = 10 mL) 
● PCR components  
● PCR plate containing inoculated ORFs - for PCR 
● E-Gel 
 
Checklist for E-Gel: 
● PCR plate 
● Compitip advanced 2,5ml 
● 96-well E-Gel 
● 96 gel loading buffer (Homemade, recipe:10mM Tris-HCl, 1mM EDTA, 0,005% bromophenol 
blue) 
● DNA marker E-Gel  
● 50 µL manual multichannel pipette for loading the gel 
● 200 µL tips 
● BioRad Detection Machine 
 
Checklist LR reaction: 
● Cold PCR block: thermomix block keep it in the cold 
● 2 PCR plates for mCit and NL fusions 
● PCR plate (CL01PCR_01) with PCR products 
● DNA for expression vectors (KL_11 & KL_247), diluted to 200 ng/µl, aliquoted to PCR stripes 
with 20µl each 
● Ice box 
● Autoclaved water 
● 12.5 µL multichannel pipette (Tick) 
● 125 µL multichannel pipette (Trick) 
● 100 µL (yellow) integra pipette tips special for a digital multichannel pipette 
● 10 µL (pink) integra pipette tips special for a digital multichannel pipette 
● Rack for eppi tubes for LR clonase (the distance of the small racks work with the digital 
multichannel pipette) 
 
 
6 
214
 
 
PCR  
PCR program: 
Temperature Time Repeat Step 
98˚C 30s 1x Initial denaturation 
98˚C 10s 30x Denaturation 
55˚C 30s 30x Primer annealing 
72˚C 3min 30x Extension 
72˚C 5min 1x Final extension 
16°C U 1x Hold 
 
Master Mix 
PCR components Per 1 reaction (=1 well) Per 100 reactions  
Primer #48 pENT-F 10µM 2.5 µL 250 µL 
Primer #49 pENT-R 10µM 2.5 µL 250 µL 
dNTPs (10mM each dNTP) 1 µL  100 µL 
10x High fidelity polymerase 5 µL 500 µL  
buffer 
High fidelity DNA polymerase  0.55 µL 55 µL 
H2O 34.45 µL 3445 µL (= 5x 689 µL) 
 
Steps PCR: 
1. Label the PCR plate (CL01PCR_01) 
2. Once the PCR components start to thaw. Vortex each PCR reagent 
3. Prepare a master mix of all PCR components (see table) 
a. In 50mL falcon tube  
4. Pipette 46 µl of the master mix in each well of the PCR plate  
a. Using the multipette and combitip 1ml (on ice/cold block) 
5. One well should be used as control (master mix without ORF) 
6. Remove the airpore seal of the inoculation plate 
7. Close the inoculation plate with aluminum foil 
8. Vortex the inoculation plate 
9. Carefully remove aluminum foil 
10. Transfer 4µL of the inoculated ORF culture to the PCR plate  
a. With the manual 10µL multichannel pipette 
b. The ORF layout is the same 
c. Always use new tips 
11. Close the PCR plate with PCR foil 
12. Vortex the plate briefly 
13. Centrifuge briefly 
7 
215
14. Run the PCR ( ~3 hours) 
 
 
Steps glycerolstock: 
1. Check two wells how much bacteria culture is left 
2. Then remove a “certain” amount to have 100µl of bacteria culture left in the inoculation plate 
3. Add 100 µL of sterile 40% glycerol to each well of the inoculation plate (1:1 ratio) 
4. Close the plate with alu foil 
5. shake 45 sec at 800 rpm on the thermomixer 
6. Store at -80°C (rack 8) 
 
Validation of the PCR product with E-gel   
- Info:  
- PCR products can be stored at 4°C for 48h, for longer time freeze PCR products 
- Document all wells that do not look ok on gel -> this info needs to go into MySQL 
table, send info to bioinformatician 
Steps:  
1. Label the E-Gel plate (i.e. CL01Gel_01) 
2. Pipette 25 µl of blue 96 gel loading buffer in the E-Gel plate 
a. Using the multipette and 2,5ml Combitip 
b. Can be done while PCR is running 
3. Add 6 µl of PCR product to each well  
a. Using the 10µl multichannel pipette 
4. Install 96 well E-gel to the motherbase 
5. Load 20µl PCR/buffer mix to each well 
a. Using the 50µl multichannel pipette 
6. Load 20µl of E-Gel 96 High range DNA marker 
7. All empty wells must also be filled with 20µl 
a. With buffer or loading dye 
8. Insert the plug into the socket 
9. Run gel for 12 min  
a. Program EG 
10. Take picture with GelDoc Station  
11. Analyze gel picture with the E-Editor 2.0 software  
a. On the desktop PC in the technical room  
b. Realign the bands and save it in your cloning project folder 
c. The software is pretty self-explanatory and has a manual available under the help 
button. Ask Katja for help. 
12. Decide if PCR was successful and whether it is worth proceeding 
13. Document all wells that did not look ok  
a. Add this information to the MySQL table entry_clone_info 
 
LR reaction 
 
Components Per 1 reaction (1 well) 
H2O 5,5 µL 
Destination vector (200ng/µl) 1 µL  
PCR product 1 µL  
4x LR clonase 2,5 µL 
 
 
8 
216
1. Label the LR plates (CL01LR_01.1, CL01LR_1.2) 
2. Decide if you want to include controls for the LR reaction 
a. I.e. no clone (only water), no LR clonase 
3. Take out the destination vectors and put it on the bench to thaw 
a. Prepare a PCR stripe with 8x 20µl of KL_11 (NL-GW) 
b. Prepare a PCR stripe  with 8x 20 µl KL_247 (mCit-His3C-GW) 
4. In a clean reservoir pour ~ 2 mL of autoclaved water 
5. Add 5,5µl water into each well 
a. Use 125 µL multichannel pipette (Trick) 
b. Aspirate 66 µl water and distribute 12x 5.5µl 
c. Repeat for the second plate. 
6. Add 1µl of PCR product 
a. Use 12.5 µL multichannel pipette (Tick) 
b. Aspirate 2 µL of PCR products 
c. load 1 µl to each PCR plate for LR reaction. 
7. Add 1µl of the NL expression vector KL_11 
a. Into the plates, which will contain NL fusions in the end (i.e. CL01LR_01.1) 
b. Use multichannel pipette 
8. Add 1 µL of the mCit expression vector KL_247 
a. Into the plates, which will contain mCit fusions in the end (i.e. CL01LR_01.2) 
b. Use multichannel pipette 
9. Take 8 tubes 4x LR clonase out and put on a rack  
a. Info: if the LR clonase is still very cold - it is difficult to pipette, LR clonase will be 
outside of the tip and the resuspension step gets difficult 
b. better: for 1 plate you will need 8 tubes of LR clonase (each tube contains 40µl) - 
vortex the tubes, centrifuge LR clonase, wait until LR clonase is easy to pipette 
(~2min) and then start, the leftover of the LR clonase should be discarded 
10. Vortex each LR clonase twice for 2 seconds and put back to the rack 
11. Add 2.5 µL LR clonase 
a. Use 12.5 µL multichannel pipette (Tick) 
b. Program HTP_LR 
c. Resuspend and discard the tips 
12. Repeat until all wells of both plates has received LR clonase 
13. Cover LR-plates with alu foil  
14. Incubate overnight at 25°C (in PCR machine) 
15. Close the plate with the PCR reaction with alu foil and store the PCR products at -20°C 
 
alternative 
 
1. Prepare all needed components for LR 
2. Prepare a master mix of your expression vector and water (can be done several days before) 
3. Aliquot 6,5µl of water/expression vector mix  into both LR plates 
a. with the multipette  
4. Add 1µl of PCR product  
a. with 10µl multichannel pipette 
5. Take 8 tubes of 4xLR clonase  
6. Vortex each LR clonase twice for 2 seconds and put back to a rack 
7. Add  2.5 µL LR clonase 
a. Use 12.5 µL multichannel pipette (Tick)  
b. Program HTP_LR 
c. Resuspend and discard the tips 
8. Repeat until each well of both plates has received LR clonase. 
9. Cover LR plates with alu foil  
10. Incubate LR plates overnight at 25°C (PCR machine, Thermoblock) 
11. Close the plate with the PCR reaction with alu foil and store the PCR at -20°C 
 
Stop point: LR plates could be stored at -20°C until processing with transformation 
 
 
9 
217
 
 
Preparing square agar plate (should be done at least the day before needed) 
 
Check list 
● LB Agar (250ml/square plate) 
● Square plates and divider 
● Microwave 
 
 
1. Take LB-Agar (250ml) from IMB media lab 
2. Use aseptic bench working technique 
3. Heat Agar in the microwave (program: soften/melt, 2= melt dark chocolate, 100 = 5,5 min; 
after 3x the agar is liquid) 
4. Let it cool down (i.e. add a clean stirrer to the agar and place the bottle on the magnetic 
stirrer, adjust the temperature to 50°C and 250rpm) 
5. Add antibiotic (250µl) when the agar is cooled down sufficiently and you are ready to pour the 
plates 
6. Take out the plate from the plastic protection 
7. Add agar to the plate (pop bubbles with a pipette tip or move them to the side) 
8. Take out the grid from the plastic protection 
9. Add the grid in the square plate with agar 
--> the grid does not stay down - weigh down the grid with something (i.e. a 250ml bottle) 
10. Let the agar solidify 
11. Store at 4°C (upside down) 
 
 
Day 3  Proteinase K digestion, Transformation and plating 
 
Check list 
● Proteinase K (2µg/µl) 
● Ice box for 2 PCR plates with LR reaction 
● Ice box for 2 PCR plates with competent DH5α cells 
● Thermoblock/PCR machine for heat shock  
● Thermoblock/PCR machine for 2 plates for recovery step 
● 2 racks for 2 PCR plates 
● 10ml reservoir to pour SOC medium 
● SOC medium (8ml/plate) 
● DH5α  (2 PCR plates with aliquots of 30 µL) 
● Square plates with agar (48 wells, 4 plates needed for 1 inoculation plate) 
 
 
Proteinase K digestion and Transformation: (~3h) 
 
1. Take out SOC medium (for one well = 80 µL, for 1 plate = 8 mL) and let it thaw at room 
temperature  
a. 50ml takes long time to thaw, could be placed at 4°C the afternoon before 
2. Use aseptic bench working technique  
3. Take out DH5α from -80°C  
a. Put them immediately on the ice 
b. Let them thaw 
c. Label the plate (i.e. CL01TR_01.1, CL01TR_1.2) 
4. Take out the LR plates from the incubation  
5. Centrifuge briefly the LR plates 
6. Add 1µl of Proteinase K into all wells 
a. Take out 8 tubes 
10 
218
b. Use multichannel pipette 
7. Vortex briefly 
8. centrifuge briefly 
9. Incubate at 37°C for 10min 
10. Transfer the plates on ice 
11. Transfer 10µL of each LR reaction into the DH5a plate  
a. Use a multichannel pipette 
b. Difficult to get the whole 10µl out (~7µl) 
c. No resuspension, no vortex when adding the LR reaction into the DH5a 
d. Close the plate with alu foil 
12. Incubate for 30 minutes on ice (bacteria with LR product) 
13. Meanwhile: set the thermoblock to 42°C for the heat shock and set thermoblock for 2 plates 
to 37°C 
14. 45sec at 42°C (heat shock) 
a. One plate after the other 
15. Immediately move the plate on ice for 2 minutes 
16. Pour SOC medium to the reservoir 
17. Transfer 80 µL of SOC medium to each well 
a. Using a multichannel pipette 
b. Discard tips after each column 
18. Transfer the plate to thermoblock/PCR machine set to 37°C 
19. Incubate for 1 hour shaking at 300rpm (no shaking is also working) 
20. Repeat the heat shock for all PCR plates with transformed cells 
21. After 1 hour of incubation, proceed with plating 
 
 
Plating bacteria (~ 1 h) 
 
1. Take the agar plates out of 4°C and let them dry (latest after the heat shock) 
2. Label the plates (i.e. CL01TR_01.1a / CL01TR_01.1b & CL01TR_01.2a / CL01TR_01.2b) 
a. G&'"+V2-/'".3-%'+"&-E'"RW":'33+"!"H"+V2-/'".3-%'+"$,/"?L"XY":'33".3-%'"0''4'4 
3. Place the agar plate on a paper grid with numbers and letters 
a. You will know better which grid field corresponds to which plate field 
4. Add the glass beads to the grid fields (between 4-12 glass beads/ field is ok) 
5. Add 70µl of the transformation to each field 
a. If you are slow it is better to work column by column 
i. Add glass beads, add bacteria, shake 
ii. You can use the lid as protection that the glass beads don’t “jump” in the 
other column 
6. Shake the plate  
a. Hold and shake the plates with both hands  
b. Check that all beads in all wells are moving  
c. Do not shake too long  
7. Press the lid on the agar plate and turn the plate over 
8. Take the bottom of the agar plate away 
9. Transfer the glass beads in a big glass beaker 
10. Clean the lid with 70% Ethanol 
11. Cover the agar plate with the lid 
12. Repeat steps 4-11 for all plates / columns 
13. Incubate overnight at 37°C upside down 
14. Add 70% ethanol to the glass beads, wash with water, transfer into a dry glass bottle and 
send them for autoclaving 
 
 
11 
219
 
 
Day 4 Colony picking and inoculation (~ 2-3 h) 
 
 
Check list 
● LB medium (1,5 ml per well, 150ml per plate) 
● Toothpicks for picking 
● Deepwell plates (Deepwell plates that are round on top and bottom, Starlab # E2896-2110)  
● 1250µl digital multichannel pipette (Track) with tips 
 
 
Steps: 
The steps are best done with one or two additional people checking that the right well is 
picked and put into the correct well in the deepwell plate 
 
 
1. Experimental person takes agar plates and uses computer script and enter which well has 
colonies (i.e. A1 - yes, A2 - no)  
a. Name of the script: script_B_picking_script.bat 
b. Can be run on lab desktop PC or via remote desktop from personal computer 
c. Takes ~ 1 hour 
d. possible break point, leave the agar plates at 4°C over the weekend 
2. Use the script that makes the rearray for your experiment to create a new plate layout 
a. Name of the script: 
b. Make sure that the rearray information is saved in the expr_clone_info MySQL DB 
table 
3. Use aseptic bench working technique 
4. Label the deepwell plates (i.e. CL01GExDW_01.1a / CL01GExDW_01.1b; 
CL01GExDW_01.2a / CL01GExDW_01.2b) 
5. Fill 1,5 ml LB-Medium in the wells  
a. Use the 1250µl digital multichannel pipette  
6. Pick one colony from the first well 
a. Using a toothpick 
b. If you want to prepare 2 identical plates: stir in the corresponding well of the deep-
well for a few seconds, then pick the same colony with the same toothpick into the 
second pick plate 
c. With the new 96 MiniPrep Kit you should get enough DNA with one deepwell plate  
d. You can leave the toothpick in the deepwell until you are done with one column 
7. Continue with the next well 
8. Repeat until all clones are picked 
9. Cover the deepwell plate with breathable foil 
10. Incubate @ 37˚C at 700rpm in the incumixer for 24h  
a. This conditions are important for successful MiniPrep 
 
 
12 
220
 
 
Day 5 Glycerol stock, Miniprep (~ 2 hours per plate) 
 
prepare glycerol stock before miniprep! 
 
Material needed: 
● 40% glycerol steril (50µl per well, 5ml per plate) 
● Costar plates for glycerol stock 
● Alu foil to cover glycerol stock 
● 1250µl digital multichannel pipette (Track) with tips 
● Qiagen 96 well Miniprep kit 
● Plate inserts for big centrifuge 
● Big glass beaker 
● Multipette with 5ml tips 
● Alu foil for resuspension 
● Costar plate for elution 
● Vacuum (pump set to 300 mbar) 
● Waste tray = square reservoir (can be autoclaved) 
 
 
Steps: 
1. Work under aseptic bench working conditions 
2. Get deepwell plates from the incubator 
3. Prepare the glycerol stock plates (i.e. CL01GEx_01.1 & CL01GEx_01.2)  
a. By adding 50 µl of 40% glycerol to all required wells of a new costar plate 
b. Check if the bacteria are in suspension - if not, vortex (cover with alu or plastic foil 
before vortexing) 
c. Add 50 µl of the incubated bacteria culture to the corresponding wells and close the 
plate with alu-foil.  
d. Shake 30sec at 750rpm on the Thermomixer 
e. Freeze @ -80˚C 
4. Centrifuge the deepwell plate @ 2100 xg for 5 min. 
5. During centrifugation: Prepare the Qiavac Multiwell with Turbo filter 96 plate and S-Block  
QIAvac 96: 
 
 
a. Seal unused wells with additional tape  
b. Note for those using the unused well from a used plate: Because there are many 
vacuum steps in the procedure and the air flows better through previously-used wells 
(now empty) than the wells that are in use now, make sure that you tape the 
previously-used wells so that the airflow passes through the wells that you want. 
Otherwise, the air will tend to flow through the previously-used wells and reduce the 
efficacy of vacuum suction. 
6. Pour out medium into beaker, tap dry the plate surface with paper towel   
a. If you have 2 identical deepwell plates: add the content of the second deepwell plate 
(CL01GExDW_01.1b) to the corresponding wells of the first plate (using digital 
multichannel to reduce the number of pipetting steps). Centrifuge @ 2100rpm for 5 
min. 
b. Pour out the medium into a beaker 
c. Tap the plate on a paper towel to empty completely 
7. Add 300 µl of buffer P1 to each well  
13 
221
a. Using the multipette or digital multichannel 
8. Close plate with alu foil   
9. Vortex to completely resuspend the bacteria 
10. Remove foil 
11. Add 300 µl of buffer P2 to each well  
a. Using the multipette or digital multichannel 
12. Close the plate with the plastic foil from the kit  
13. Invert 6-8 times 
14. Incubate 5min at room temperature.  
a. Do not let the lysis take longer than 5 min  
b. Count in time from first well having received the lysis buffer 
15. Remove foil  
16. Tap dry the plate top 
17. Add 300 µl of buffer S3 to each well  
a. Using the multipette or multichannel  
18. Close the plate with the plastic foil from the kit 
19. Invert 6-8 times 
20. Remove foil and tap dry the plate top 
21. Transfer content of each well in the corresponding well in the Turbo Filter 96 plate 
a. Using the digital multichannel (set to 1000µl) 
22. Apply vacuum  
a. Pump set to 300 mbar  
b. To suck liquid in the S-block 
c. Make sure all liquid has passed the filter plate 
23. Close vacuum 
24. Remove filter-plate from assembly 
25. Discard the filter plate 
26. Remove S-Block  
a. DNA is here 
27. Install waste try in the assembly 
28. Install Plasmid Plus 96 plate in the assembly 
29. Seal and label unused wells with tape 
30. Add 300 µl of buffer BB to each well in S-Block  
a. Using the multipette or digital multichannel 
31. Close the S-Block with the plastic foil from the kit 
32. Invert 1-3 times 
33. Remove foil 
34. Tap dry the S-Block on top 
35. Transfer content of each well in the corresponding well in the Plasmid Plus 96 plate 
a. Using the digital multichannel (set to 1250µl) 
36. Apply vacuum  
a. Pump set to 300 mbar 
b. To suck liquid in the waste tray 
c. Make sure all liquid has passed the plate 
37. Close vacuum 
38. Transfer 900 µl of buffer PE in each well in the Plasmid Plus plate 
a. Using the digital multichannel 
39. Apply vacuum  
a. Pump set to 300 mbar 
b. To suck liquid in the waste tray 
c. Make sure all liquid has passed the plate 
40. Close vacuum 
41. Empty waste tray  
42. Pat dry the nozzles of the Plasmid Plus plate until now liquid can be seen on the paper towel 
43. Put back the waste tray and assemble 
44. Apply vacuum for 10 min 
a. Pump set to 300 mbar 
b. To dry the filter 
45. Close vacuum 
46. Lift the top plate from the base - but not the Plasmid Plus plate from the top plate! 
14 
222
47. Vigorously tap the top plate on a stack of absorbent paper until no more drops come out 
a. Blot the nozzles of the Plasmid Plus plate with clean absorbent paper 
48. Remove the waste tray 
49. Place 2 “old” costar plates (one with lid and one without lid) in the assembly  
a. To reach the required height, the nozzles should reach the wells of the costar plate 
50. Place your elution plate (i.e. CL01GExMP_01.1) in the assembly and reassemble 
51. Add 70µl of water/EB-buffer to the center of each well of the Plasmid Plus plate  
a. Using a manual multichannel 
52. Let stand for 3 min 
53. Apply vacuum for 1 min 
54. Close vacuum 
55. Disassemble the Qiavac Multiwell to get your DNA 
 
Stop point. DNA can be frozen @ -20˚C and stored. 
 
 
Nanodrop measurement 
 
1. Using part 1 of script C, create a template for the Nanophotometer and save it in the 
Nanophotometer folder on the group drive (i.e./imb-
luckgr/NanoPhotometer/HTP_data/CL100/) 
2. Thaw plates with DNA (i.e. CL01GExMP_01.1 & CL01GExMP_01.2)  
3. Centrifuge 3min @ 3000g in the big centrifuge  
4. Load the correct measuring template to the NanoPhotometer  
a. On the NanoPhotometer, click ‘Nucleic Acid’, then swipe right and click the top right 
button that looks like a barcode. Click ‘Sample’ and then click ‘Import’. Select 
‘Network_Groupdrive’ to find the NanoPhotometer folder in the group drive mentioned 
in point 1. There you can find your measuring templates and load them into the 
Nanophotometer for measurement 
5. Measure the DNA concentration 
6. Save the data to the group drive in the corresponding folder 
a. Save the measurement in the same folder so that you can access it through the 
groupdrive too 
b. If the folder ‘Network_Groupdrive’ does not appear on the Nanophotometer, try 
restarting it 
7. If needed you can concentrate your DNA: 
a. Place the plate (without lid) in the dessicator  
b. Turn on vacuum and let evaporate until the desired volume/concentration is reached 
c. ~36h for 20-25µl reduction in volume 
d. Ask Christian for help, if needed 
 
 
15 
223
Day 6  DNA dilution and sequencing 
 
The first sequencing is done with both backbone primers (forward and reverse), full coverage 
sequencing for inserts is done after results come back for those that need it 
 
1. Use part 2 of script C to calculate the dilutions needed 
2. Make sure that the measured DNA concentrations are uploaded to the expr_clone_info 
MySQL DB table 
3. Prepare the dilutions (i.e. CL01GExDil_01.1 & CL01GExDil_01.2)  
a. according to the template you created 
b. DNA concentration should be around 100 ng/µl 
4. For the expression test you will need to dilute the NL plate once more 
a. Option 1. 1:10 (you take 1µl for expression test) 
b. Option 2. 1:25 (you take 2µl for expression test) 
 
DNA stock 
1. Label PCR plates with labels for DNA stock (i.e. CL01GExSt_01.1 & CL01GExSt_01.2) 
2. Pipette 10µl of the not diluted DNA (CL01GExMP_01.1 & CL01GExMP_01.2) to the stock 
plates 
3. Close plates with alu foil and give to Mareen for storage 
 
Sequencing 
 
Each plate has to be submitted individually to StarSeq. You will get a zip file containing one .ab1 and 
.seq file for each sample submitted in the plate. You can use the plate barcodes for plates with more 
than 78 samples or you can submit individual barcodes for plates with <78 samples. If you are 
sending a plate to Starseq you have to have at least 48 samples on the plate 
 
Steps: 
 
1. Prepare an Excel file (one for each sequencing run) with the file names of your sequencing 
samples in 96-well format. Suggested file names: e.g. mCit-[ORF ID]-F for the mCit construct 
and forward read. The layout should correspond to what you have generated after picking the 
colonies (i.e. CL01GExh_01.1) 
2. Label the PCR plates for sequencing (i.e. CL01GExSF_01.1 & CL01GExSF_01.2; 
CL01GExSR_01.1 & CL01GExSR_01.2). 
3. Add 1µl of the corresponding primer to the plates 
a. primer # 44 NanoLuc-398fwd - for N-terminal NL fusion 
b. primer # 47 mCitrine-547fwd for N-terminal mCit fusion 
c. primer # 51 pEXP_rev for no C-terminal fusion 
d. Using the multipette and combitip 1ml 
e. Alternatively, one can also aliquot the primers into PCR tubes and use digital 
multichannel to distribute the primers into the wells 
4. Add 6 µl of the diluted DNA to the sequencing plate 
a. I.e. from CL01GExDil_01.1 & CL01GExDil_01.2 
b. Using manual multichannel pipette 
5. Close the sequencing plate using the alu foil 
6. Order the sequencing on the StarSeq webpage 
a. Use the Excel file created in step 1 to copy paste the plate layout into their web form 
7. Pack plate together with paperwork in a padded envelope 
a. To avoid the foil getting pierced 
b. Submit each plate as an individual sequencing run 
c. When submitting multiple plates, results will likely not come back all by next morning 
but over the next 24-36h 
8. Process the sequencing results with the Sanger seq processing pipeline  
a. Instructions can be found in labfolder under templates 
9. Make sure to update results accordingly in the expr_clone_info MySQL DB table 
 
 
16 
224
Day 7 Transfection 
 
Expression test 
 
CS notes:  
- I did get 6x106 HEK293 cells out of 1 T-25 flask lately. 
- I found it more convenient to do the triplicates in separate plates. 
- I did not mix DNA with Lipofectamin before, only when I put the DNA to the final incubation 
plate. 
- While I did NL and mCit the same day, I pipetted them separately as it is very hard to handle 
6 plates at the same time. 
- The volumes I put here (most of the time) depend on your transfection ratio and DNA 
concentrations used. I did NL-constructs 4ng/µl, mCit 100ng/µl, pcDNA 200ng/µl; 2:50 ratio 
 
Steps: 
 
1. Prepare the layout of your plate with the controls  
a. controls: NL-stop + pcDNA, mCit-stop + pcDNA, well with only pcDNA, well with only 
cells 
b. The controls you put depend on your experiment and space you have on the plate. If 
you have doubts, talk to Katja 
2. Prepare the DNA for your controls, PA-mCit-Stop, NL-Stop and pcDNA3.1 
a. can be prepared in PCR stripes - then you can later use the multichannel pipette 
3. Prepare an additional dilution of the NL-constructs to 4ng/µl (if you haven’t already) 
4. Take a PCR plate 
5. Add the pcDNA (3µl) to the wells first.  
a. Using multipette or multichannel pipette  
b. Doing the pcDNA first allows you to do everything with one tip. Try to get the DNA to 
the bottom of the plate. 
6. Add the NL-Stop (for the mCit-constructs) or mCit-Stop (for the NL-constructs)  
a. 2µl to the wells 
b. Using the multipette or multichannel pipette  
c. for multipette: using one tip is possible for this as the only possible contamination 
would be with pcDNA which can be avoided by putting the DNA at the wall of the 
wells away from the pcDNA 
7. Add your diluted construct DNA (mCit or NL)  
a. 2µl if you are using the DNA concentrations written on top 
b. Using the multichannel pipette 
8. Add the DNA for the controls to the wells 
9. Tap plate to mix all DNA in the bottom of the well 
10. Add 100µl Optimem to each well 
11. Prepare the Lipofectamine-Optimem mixture in a 15ml Falcon tube 
a. You do not need to do quadruples here. This saves some lipofectamine 
b. Example:  
78 wells/plate x 0.5µl Lipo/well x 3 plates = 117µl Lipo 
78 wells/plate x 25µl Optimem/well x 3 plates = 5.85ml Optimem 
Now add some for the reservoir: 
=> 120µl Lipofectamine + 6ml Optimem 
12. Label the plates for incubation (i.e. LuXXXrXX) 
13. Add Lipo-Opti mixture to a 10ml reservoir by pipetting  
a. Decanting is suboptimal as it leaves some residual mixture in the Falcon tube 
14. Add 25µl Lipo-Opti mixture to each well of the incubation plates (white 96well plate for LuTHy) 
a. Using the multichannel pipette 
15. Take out cells, wash and add trypsin 
16. While the trypsination is ongoing:  
a. Transfer DNA-Opti mixture in the incubation plates 
b. You can use a digital multichannel (Trick) to speed it up (aspirate 75µl, dispense 
3x25µl) 
c. Predispense step is needed to get accurate amounts for the first dispense of the 
multi-dispense 
17 
225
d. The 20min time limit starts now 
17. Quench trypsin, resuspend cells, count cells, centrifuge and adjust concentration to 2.67x105 
cells/ml in phenol-red free DMEM medium. 
18. Decant the cells in a 25ml reservoir 
19. Add 150µl cell suspension to the plates 
a. Using the digital multichannel 
b. aspire 450µl, then dispense 3x150µl doing the triplicate without changing tips 
c. Use program called” LUTHY CELLS” in 1250µl digital multichannel pipette (Track). 
The program first resuspends the cell multiple times (called ‘Mix’ in the program), and 
then aspirates 450µl for the repeat dispense of 3x150µl 
20. Incubate for 48h 
21. Proceed with measurement as usual for LuThy assay 
22. For the LuTHy processing scripts to be able to process your data, KL numbers have to be 
generated for all the constructs on your plate.  
23. Make sure the KL numbers are generated and saved in the LUCK_DB.Luck_lab_plasmids 
table along with all available information.  
 
 
Make sure to update the LUCK_DB.Luck_lab_plasmids table according to new sequencing and 
other experimental results you obtain, i.e. enter if the plasmid is full length sequenced or 
partial, add mutation information, let Katja know, if ORF turned out to be a different ORF and 
which ORFs need a new ORF ID. Let Katja know about KL numbers that should be deleted 
because the insert could not be confirmed. 
 
 
18 
226
5.1.2 The medium-throughput site-directed mutagenesis
227
Site-directed mutagenesis (without Kit)
Day 0 Primer design
Criteria for mutagenesis primers:
- Primer length should be 32-36 nt. If it is shorter, the mutation might not be cloned properly!
- GC content of primer should be between 40 to 60%
- Difference in melting temperature between the forward and reverse primers should be ideally
less than 5ºC. (use NEB Tm calculator: https://tmcalculator.neb.com/#!/main and select
Phusion as the product group for the melting temperature of primer)
- The annealing temperature of PCR reaction should be set at the value which corresponds to
5ºC lower than the lowest melting temperature among the primers
- The 3` end of the primer should ideally be C or G
- The annealing temperature should be below 70°C if possible
primer order info, if you have 24 or more primers (IDT company)
If you want to orde primers in plate
price wise:
The prices for DNA oligos when in tubes or in plates can be seen in our website here . The prices are
usually a bit lower for oligos in plates however, one should look at the final cost of the whole order. For
example when ordering in plates there is a minimum of 24 oligos that should occupy the plate. So in
the long run, plates are not always the cheaper option.
dry or wet primers
When ordering in plates you can choose your oligos to be normalized to a certain amount. That can
be either as a pellet (dry) or resuspended (wet). In that way you will avoid needing higher volume than
the capacity of the well. These can be adjusted in the "Plate Specifications" button when ordering the
plate oligo . I would say that there are no pros and cons for primers in pellet or in solution in terms of
primer performance, stability and so on. It is more a matter of experimental needs and set up. Some
researchers prefer to receive their oligos ready to use whereas others want to resuspend them in a
certain buffer or in a specific dilution. When automation and robot handling is included, people prefer
having plates than tubes. On the other hand, when having the oligos in plates and manual pipetting is
done the chances for contamination or spillage can be higher.
Design of a point mutation (non-kit way)
- Design forward and reverse primers that overlap at the site of mutation. Try to locate the
mutation to be at the middle of the overlapping region so that the mutation is flanked by
complementary sequences. The overlapping part (that contains your mutation) should be
20-2215-22nt nt long.
- Here is an example to mutate L152E in benchling. To know what codon codes for E, right click
on L152 and select ‘Change amino acid’. Remember to change Organism to Homo sapiens.
There you can find the codon that codes for the amino acid that you want to mutate to, and
the best codon change to achieve that amino acid substitution. Do take into consideration the
number of bases that need to be changed for the amino acid substitution and the frequency of
codon to ensure optimal mutagenesis
228
-
Design a deletion (non kit way)
- The forward and reverse primers have to overlap at the overhang so that the synthesized
strands can circularize after amplification. The non-overhang region should have ~20 nt and
the overhang (overlap) region should have 15 nt.
- Here is a schematic showing how the primers with overhang should be designed. (the
scheme and explanation are retrieved from takara :
https://www.takarabio.com/learning-centers/cloning/applications-and-technical-notes/mutagen
esis-with-in-fusion-cloning)
Design a Deletion with Q5 kit
- Design forward and reverse primers that exclude the deletion site.
- Here is the schematic.
229
Design sequencing primer
- use primer design tool
- follow the instruction of the primer designer tool
- in labfolder → “templates” → “instructions” → “How to design primers with
PrimerDesigner”
Primer order from Sigma
1. design the primer
2. order Oligos in solution (water)
a. the price is the same and it will save time to you
3. import all needed data into the excel file from sigma
a. there are 2 excel files: 1x for single oligo order and 1x to order oligos in plate
b. can be found on the group drive (Primer_AG_Luck → Sigma_order_template or
DNAPLATE_96well_8ch_template_sigma)
c. the excel sheet to order single oligos is also saved on the intranet (Administration →
Purchasing → Oligo Ordering → Sigma Oligo template)
d. the link to order oligos in plate: [DNA-Oligos in Platten
(sigmaaldrich.com)](https://www.sigmaaldrich.com/DE/de/configurators/plate?product
=dnaplate&activeLink=sequenceUpload)
e. Oligos in plate:
i. ask for quote for your order
ii. after uploading the file you need to set the “scale”. “purification” and “format”
iii. scale = 0,025; purification = desalt; format = in solution (water)
4. upload the excel sheet to the order manager under service agreements
a. oligos are ordered every tuesday and thursday after 2pm
b. it might take up to 1 week to receive the oligos, oligos in plate take ~ 2days longer
than oligos in tubes
5. after receiving the oligos you have to dilute them
Primer order from IDT
1. design the primer
2. register with IDT
3. go to “Products and Service” → “DNA and “RNA” → “Custom DNA oligos” → “DNA Oligos” →
“Single-stranded DNA” (you can choose between tubes and plates) → press the button “order
now” →
4. order primer in tubes:
a. enter all required informations in the fields
230
b. you can use the button “bulk input” if you have several oligos to order
i. Scale: 25nmole DNA oligo
ii. Formulation: you can choose “None” = dry or “LabReady (100µM in IDTE,
pH8,0)”
iii. Purification: Standard desalting
iv. Sequence:
5. order primer in plates:
a. choose plates
b. 25nmole → order
c. download the excel sample ordering template (under upload plates)
d. fill out the excel sheet
e. upload the excel sheet
f. check the upload
g. if necessary make changes
6. add to order
7. order the oligos via internet, add the email of the purchase department
“einkauf@imb-mainz.de” in the distribution list for order confirmations
8. enter your IDT order in the Order Manager
Primer dilution in plates
1. take a new PCR plate
2. label the plate (i.e. MU01PrDilF_01 and MU01PrDilR_01)
3. add 90µl water in each well
4. add 10µl primer in the corresponding well
5. close the plate
6. mix/vortex
7. freeze until needed
Preparation of template DNA 10ng/µl (i.e. MU01TD_01)
1. take a new PCR plate
2. label the plate (i.e. MU01TD_01)
3. add 9µl water in each well
4. add 1µl template DNA in the corresponding well
a. take the template DNA from your diluted MiniPrep 100ng/µl
5. close the plate
6. mix/vortex
7. freeze until needed
Day 1 PCR, DPN1 digestion and E-Gel
Checklist PCR, DPN1 digestion and E-Gel:
96-well skirted PCR plates (3x)
5ml tube (Axygen, # SCT-5ml-S) or 50 mL falcon tube - to prepare PCR master mix, 50ml
because of multipette
PCR foil
multichannel pipette 10µl
10µl pipette tips (4 boxes)
multichannel pipette 50µl
100µl tips
multipette
combitips 1ml (to add the PCR Master Mix)
combitips 0,1ml (to add DPN1)
100µl pipette
100µl pipette tips
1000µl pipette
1000µl pipette tips
ice block - to keep PCR components in cold
231
PCR machine or Thermomixer
PCR components
DPN1
E-Gel 96 1% Agarose (GP) (invitrogen, # G700801)
E-Gel 96 High range DNA marker
PCR program:
temperature time cycle step
98˚C 2min 1x initial denaturation
98˚C 30s 25x denaturation
__ __˚C * 15s 25x primer annealing
72˚C 5min (1min/1kb) 25x extension
72˚C 5min 1x final extension
16°C ∞ 1x
* Temperature depends on the primer, try to keep the temperature below 70°C when designing the
primer
if melting temp of 1primer is less than 69°C than annealing temp = 55°Cif higher = 63°C
PCR reaction (50µl total)
PCR components 1x (1well) x 100
primer (10µM) 2.5 µL 250 µL
primer (10µM) 2.5 µL 250 µL
template DNA (10ng) 1µl 100 µl
dNTPs (10mM) 1 µL 100 µL
10x HF Buffer 10 µL 1000 µL
High fidelity DNA polymerase 0.5 µL 55 µL
H2O 32.5 µL 3250 µL (= 5x 650 µL)
Master Mix
PCR components 1x (1well) x 100
dNTPs (10mM) 1 µL 100 µL
10x HF Buffer 10 µL 1000 µL
High fidelity DNA polymerase 0.55 µL 55 µL
H2O 32.5 µL 3250 µL (= 5x 650 µL)
Steps:
232
1. Label the PCR plate (i.e. MU01PCR_01)
2. Once the PCR components started to thaw vortex each PCR reagent
3. Prepare the master mix
a. In 5ml tube or 50mL falcon tube
4. Pipette 44 µl of the master mix in each well of the PCR plate (on ice/cold block)
a. 44,5µl is not possible with multipette
b. Using the multipette and combitip 1ml
5. Add 2,5µl of each primer to the PCR plate
a. Using the multichannel pipette
b. Pipette from the primer working solution plate (i.e. MU01PrDilF_01, MU01PrDilR_01)
6. Add 1µl of purified template DNA (~10ng)
a. Use multichannel pipette
b. Pipette from the template DNA plate (i.e. MU01TD_01)
7. One well should be used as control (master mix without ORF)
8. Close the PCR plate with PCR foil
a. be sure to close every column and row using the grey plastic “card”
9. Vortex the plate briefly
10. Centrifuge briefly
11. Run the PCR ( ~3 hours)
if melting temp of 1primer is less than 69°C than annealing temp = 55°C
if higher = 63°C
DpnI digestion (using commercial DPN1)
Steps:
1. Prepare a new PCR plate with DPN1
a. Can be done while the PCR is running
b. Label the plate (i.e. MU01Dpn_01)
c. Add 2µl DPN1 with the multipette to the DPN1 plate
i. The multipette must touch the PCR plate while pipetting to ensure that the
2µl of DPN1 enters into each well
2. Add 50µl of PCR product to the plate with DPN1
a. Using the multichannel pipette 50µl
3. Incubate for 1h at 37°C (PCR machine or thermomixer)
4. Incubate for 20 min at 65°C (to stop the DPN1 reaction)
Validation of the PCR product with E-gel
- Info:
- PCR products can be stored at 4°C for 48h, for longer time freeze PCR products
- Document all wells that do not look ok on gel -
Steps:
1. Label the E-Gel plate (i.e. MU01Gel_01)
2. Pipette 25 µl of blue 96 gel loading buffer in the E-Gel plate
a. Using the multipette and 2,5ml Combitip
b. Can be done while PCR is running
3. Add 6 µl of PCR product to each well
a. Using the 10µl multichannel pipette
4. Install 96 well E-gel to the motherbase
233
5. Load 20µl PCR/buffer mix to each well
a. Using the 50µl multichannel pipette
6. Load 20µl of E-Gel 96 High range DNA marker
7. All empty wells must also be filled with 20µl
a. With buffer or loading dye
8. Insert the plug into the socket
9. Run gel for 12 min
a. Program EG
10. Take picture with GelDoc Station
11. Analyze gel picture with the E-Editor 2.0 software
a. On the desktop PC in the technical room
b. Realign the bands and save it in your cloning project folder
c. The software is pretty self-explanatory and has a manual available under the help
button. Ask Katja for help.
d. explanation how to do it by john
12. Decide if PCR was successful and whether it is worth proceeding
13. Document all wells that did not look ok
a. Add this information to the respective MySQL table
Preparing square agar plate (should be done at least the day before needed)
Check list
LB Agar (250ml/square plate)
Square plates and divider
Microwave
1. Take LB-Agar (250ml) from IMB media lab
2. Use aseptic bench working technique
3. Heat Agar in the microwave (program: soften/melt, 2= melt dark chocolate, 100 = 5,5 min;
after 3x the agar is liquid)
4. Let it cool down (i.e. add a clean stirrer to the agar and place the bottle on the magnetic
stirrer, adjust the temperature to 50°C and 250rpm)
5. Add antibiotic (250µl) when the agar is cooled down sufficiently and you are ready to pour the
plates
6. Take out the plate from the plastic protection
7. Add agar to the plate (pop bubbles with a pipette tip or move them to the side)
8. Take out the grid from the plastic protection
9. Add the grid in the square plate with agar
--> the grid does not stay down - weigh down the grid with something (i.e. a 250ml bottle)
10. Let the agar solidify
11. Store at 4°C (upside down)
Day 2 Transformation and plating
Checklist Transformation and plating:
48 well square plates with agar and antibiotic (2 plates are needed for 96 well plate)
SOC medium (8ml/plate)
10ml reservoir
DH5a (30µl)
multichannel pipette 50µl
100µl pipette tips
multichannel pipette 300µl
300µl pipette tips
200µl pipette
200µl pipette tips
234
glass beads
70% Ethanol
Thermomixer/PCR machine at 42°C and 37°C
Ice box
Transformation:
1. Take out SOC medium (for one well = 80 µL, for 1 plate = 8 mL) and let it thaw at room
temperature
a. 50ml takes long time to thaw, could be placed at 4°C the afternoon before
2. Use aseptic bench working technique
3. Take out DH5α from -80°C
a. Put them immediately on the ice
b. Let them thaw
c. Label the plate (i.e. MU01_TR01)
4. Take the plate after Dpn1 digestion (i.e. MU01Dpn_01)
5. Transfer the plates on ice
6. Transfer of the digested PCR product into the DH5a plate
a. Us3 µL e a multichannel pipette
b. No resuspension, no vortex when adding the PCR product into the DH5a
c. Close the plate with alu foil
7. Incubate for 30 minutes on ice (bacteria with PCR product)
8. Meanwhile: set the thermoblock to 42°C for the heat shock and set another thermoblock to
37°C
9. 45sec at 42°C (heat shock)
10. Immediately move the plate on ice for 2 minutes
11. Pour SOC medium to the reservoir
12. Transfer 80 µL of SOC medium to each well
a. Using a multichannel pipette
b. Discard tips after each column
13. Transfer the plate to thermoblock to 37°C
14. Incubate for 1 hour shaking at 300 rpm (no shaking is also working)
15. After 1 hour of incubation, proceed with plating
Plating bacteria (~ 1 h)
1. Take the agar plates out of 4°C and let them dry (latest after the heat shock)
2. Label the plates (i.e. MU01_TR_01a, MU01TR_01b)
a. The square plates have 48 wells → 2 square plates for 1x 96 well plate needed
3. Place the agar plate on a paper grid with numbers and letters
a. You will know better which grid field corresponds to which plate field
4. Add the glass beads to the grid fields (between 4-12 glass beads/ field is ok)
5. Add 70µl of the transformation to each field
a. 70µl needs a bit longer to dry - do not turn immediately after shaking
b. If you are slow it is better to work column by column
i. Add glass beads, add bacteria, shake
ii. You can use the lid as protection that the glass beads don’t “jump” in the
other column
6. Shake the plate
a. Hold and shake the plates with both hands
b. Check that all beads in all wells are moving
c. Do not shake too long
7. Press the lid on the agar plate and turn the plate over
8. Take the bottom of the agar plate away
9. Transfer the glass beads in a big glass beaker
10. Clean the lid with 70% Ethanol
11. Cover the agar plate with the lid
12. Repeat steps 4-11 for all plates / columns
235
13. Incubate overnight at 37°C upside down
14. Add 70% ethanol to the glass beads, wash with water, transfer into a dry glass bottle and
send them for autoclaving
Day 3 Colony picking and inoculation (~ 2-3 h)
Check list
LB medium (1,5 ml per well, 150ml per plate)
Toothpicks for picking
Deepwell plates (Deepwell plates that are round on top and bottom, Starlab # E2896-2110)
1250µl digital multichannel pipette (Track) with tips
Steps:
The steps are best done with one or two additional people checking that the right well is
picked and put into the correct well in the deepwell plate
1. Experimental person takes agar plates and uses computer script and enter which well has
colonies (i.e. A1 - yes, A2 - no)
a. Name of the script: script_B_picking_script.bat
b. Can be run on lab desktop PC or via remote desktop from personal computer
c. Takes ~ 1 hour
d. possible break point, leave the agar plates at 4°C over the weekend
2. Use the script that makes the rearray for your experiment to create a new plate layout
a. Name of the script:
b. Make sure that the rearray information is saved in respective MySQL table
3. Use aseptic bench working technique
4. Label the deepwell plates (i.e.MU01DW_01)
5. Fill 1,5 ml LB-Medium in the wells
a. Use the 1250µl digital multichannel pipette
6. Pick one colony from the first well
a. Using a toothpick
b. If you want to prepare 2 identical plates: stir in the corresponding well of the
deep-well for a few seconds, then pick the same colony with the same toothpick into
the second pick plate
c. With the new 96 MiniPrep Kit you should get enough DNA with one deepwell plate
d. You can leave the toothpick in the deepwell until you are done with one column
7. Continue with the next well
8. Repeat until all clones are picked
9. Cover the deepwell plate with breathable foil
10. Incubate @ 37˚C at 700rpm in the incumixer for 24h
a. This conditions are important for successful MiniPrep
Day 4 96 well Miniprep and Nanodrop measurement
for the MiniPrep please use the protocol “Miniprep_96well_plate” in labfolder
Day 5 DNA dilution and sanger sequencing
236
5.1.3 Figures
Expression profiles
Figure 5.1: Expression of wild-type proteins and mutants. (A) The expres-
sion levels of wild-type (WT), mutants, and patient variants fused to NanoLuc were
measured. The x-axis represents the names of the wild-type, mutants, and variants,
while the y-axis indicates the luminescence intensity for each protein. Each protein
was co-expressed with an empty mCit-control to verify expression. (B) The expres-
sion levels of wild-type (WT), mutants, and patient variants fused to mCit were
assessed. The x-axis represents the names of the wild-type, mutants, and variants,
while the y-axis shows the fluorescence intensity for each protein. To verify expres-
sion, each protein was co-expressed with an empty NL control.
237
Figure 5.2: Expression of wild-type proteins and mutants protein pairs
during BRET saturation assay. (A) The bar plots indicate the luminescence
intensity for NL-CTBP1 wild-type and mutants, and the fluorescence intensity for
mCit fused partners wild-type and mutants as well. (B) The bar plots indicate the
luminescence intensity for NL-WWOX wild-type and mutants, and the fluorescence
intensity for mCit fused partners wild-type and mutants.
238
Figure 5.3: The validation of predicted interface of WWOX-SNRPC in-
teraction and the variant effect on this ppi(A) Schematic representation of
the WWOX-SNRPC interaction and putative interface. The protein containing the
predicted interacting motif is shown in green, and the domain-containing protein is
in grey. The question mark indicates that the AF-MM fragmentation approach was
used to predict the potential interface. B) AF-MM predicted interface structural
models: (Bi) The WWOX WW domain (grey) with highlighted mutated residues
(blue) for domain validation and the motif (green). (Bii) The same predicted struc-
ture illustrating pathogenic (red) and VUS (grey) variants in the WWOX WW
domain. (C-D) BRET saturation assay data and expression profiles: (C) BRET
saturation curves showing the effects of WW domain mutants (see legend), motif
deletion, and N-terminal truncation of SNRPC on binding affinity. The effects of
pathogenic (red) and VUS (grey) variants on the interaction are also shown. (D)
Expression profiles of wild-type and mutant interactions, with color coding corre-
sponding to panel (C). (E-F) Validation of the WWOX-CSNK2B interaction:(E)
BRET saturation curves showing the effects of WW domain mutants (see legend)
and pathogenic (red) and VUS (grey) variants on the interaction with CSNK2B. (F)
Expression profiles of wild-type and mutant interactions in the WWOX-CSNK2B
interaction, color-coded as in panel (E).
239
Figure 5.4: Expression of wild-type proteins and mutants protein pairs
during BRET saturation assay. (A) The bar plots indicate the luminescence
intensity for NL-IQCB1 wild-type and mutants, and the fluorescence intensity for
mCit fused partners wild-type and mutants as well. (B) The bar plots indicate the
luminescence intensity for NL-SPOP wild-type and mutants, and the fluorescence
intensity for mCit fused partners wild-type and mutants. (C) The bar plots indicate
the luminescence intensity for NL-PPP3CA wild-type and mutants, and the fluores-
cence intensity for mCit fused partners wild-type and mutants. (D) The bar plots
indicate the luminescence intensity for NL-REPS1 wild-type and mutants, and the
fluorescence intensity for mCit fused partners wild-type and mutants.
240
Figure 5.5: Expression of wild-type proteins and variants pairs during
BRET saturation assay. (A)The bar plots indicate the luminescence intensity
for NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-
LITAF wild-type and variants as well. (B) The bar plots show the expression levels
of WWOX-DAZAP2 wild-type and variant interactions. The luminescence intensity
for NL-WWOX wild-type and mutants, and the fluorescence intensity for mCit-
DAZAP2 wildtype and VUS Y46C. (C) The bar plots indicate the luminescence
intensity for NL-WWOX wild-type and variants, and the fluorescence intensity for
mCit-CPSF6 wild-type and VUS. (D) The bar plots indicate the luminescence in-
tensity for NL-WWOX wild-type and variants, and the fluorescence intensity for
mCit-HOXA1 wild-type. (E)The bar plots indicate the luminescence intensity for
NL-WWOX wild-type and variants, and the fluorescence intensity for mCit-SNRPC
wild-type. (F) The bar plots indicate the luminescence intensity for NL-WWOX
wild-type and variants, and the fluorescence intensity for mCit-CNSK2B wild-type.
241
Figure 5.6: Expression of wild-type proteins and variants pairs during
BRET saturation assay. (A)The bar plots indicate the luminescence intensity
for NL-IQCB1 wild-type and variants, and the fluorescence intensity for mCit-fused
partners wild-type and variants. (B) The bar plots indicate the luminescence in-
tensity for NL-CTBP1 wild-type and variants, and the fluorescence intensity for
mCit-fused partners wild-type and variants.
242
Figure 5.7: Expression of wild-type proteins and variants pairs during
BRET saturation assay. The bar plots indicate the luminescence intensity for
NL-REPS1 wild-type and variants of TRAPPC2L, the luminescence intensity for
NL-SPOP wild-type and variants, and the fluorescence intensity for mCit-MYD88
partners wild-type and variants.
243
Bibliography
Adzhubei, Ivan A, Steffen Schmidt, Leonid Peshkin, Vasily E Ramensky, Anna
Gerasimova, Peer Bork, Alexey S Kondrashov, and Shamil R Sunyaev (2010).
“A method and server for predicting damaging missense mutations.” In: Nature
Methods 7.4, pp. 248–249. doi: 10.1038/nmeth0410-248.
Akdel, Mehmet, Douglas E V Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O
Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L Good, Roman A Laskowski,
Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz
Serra, Carlos H M Rodrigues, Alistair S Dunham, David Burke, Neera Borkakoti,
Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex
Bateman, Andrey V Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Du-
rairaj, David B Ascher, Janet M Thornton, Norman E Davey, Amelie Stein,
Arne Elofsson, Tristan I Croll, and Pedro Beltrao (2022). “A structural biol-
ogy community assessment of AlphaFold2 applications.” In: Nature Structural &
Molecular Biology 29.11, pp. 1056–1067. issn: 1545-9993. doi: 10.1038/s41594-
022-00849-w.
Alberts, Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and
Peter Walter (2002). Molecular Biology of the Cell. Garland Science. isbn: 0-
8153-3218-1, 0-8153-4072-9.
Apic, G, J Gough, and S A Teichmann (2001). “Domain combinations in archaeal,
eubacterial and eukaryotic proteomes.” In: Journal of Molecular Biology 310.2,
pp. 311–325. doi: 10.1006/jmbi.2001.4776.
Arimura, T, T Nakamura, S Hiroi, M Satoh, M Takahashi, N Ohbuchi, K Ueda, T
Nouchi, N Yamaguchi, J Akai, A Matsumori, S Sasayama, and A Kimura (2000).
“Characterization of the human nebulette gene: a polymorphism in an actin-
binding motif is associated with nonfamilial idiopathic dilated cardiomyopathy.”
In: Human Genetics 107.5, pp. 440–451. doi: 10.1007/s004390000389.
Babu, M Madan, Richard W Kriwacki, and Rohit V Pappu (2012). “Structural
biology. Versatility from protein disorder.” In: Science 337.6101, pp. 1460–1461.
doi: 10.1126/science.1228775.
Babu, M Madan, Robin van der Lee, Natalia Sanchez de Groot, and Jörg Gsponer
(2011). “Intrinsically disordered proteins: regulation and disease.” In: Current
244
Opinion in Structural Biology 21.3, pp. 432–440. doi: 10.1016/j.sbi.2011.
03.011.
Bagowski, Christoph P., Wouter Bruins, and Aartjan J. W. Te Velthuis (2010). “The
nature of protein domain evolution: shaping the interaction network”. In: Current
Genomics 11.5, pp. 368–376. doi: 10.2174/138920210791616725.
Berg, J M and H A Godwin (1997). “Lessons from zinc-binding peptides.” In: Annual
review of biophysics and biomolecular structure 26, pp. 357–371. doi: 10.1146/
annurev.biophys.26.1.357.
Björklund, Asa K, Diana Ekman, Sara Light, Johannes Frey-Skött, and Arne Elofs-
son (2005). “Domain rearrangements in protein evolution.” In: Journal of Molec-
ular Biology 353.4, pp. 911–923. doi: 10.1016/j.jmb.2005.08.067.
Blake, C. C. F., D. F. Koenig, G. A. Mair, A. C. T. North, D. C. Phillips, and V. R.
Sarma (1965). “Structure of Hen Egg-White Lysozyme: A Three-dimensional
Fourier Synthesis at 2 Å Resolution”. In: Nature 206, pp. 757–761. doi: 10.
1038/206757a0.
Braun Tasan, Murat, Matija Dreze, Miriam Barrios-Rodiles, Irma Lemmens,
Haiyuan Yu, Julie M Sahalie, Ryan R Murray, Luba Roncari, Anne-Sophie de
Smet, Kavitha Venkatesan, Jean-François Rual, Jean Vandenhaute, Michael E
Cusick, Tony Pawson, David E Hill, Jan Tavernier, Jeffrey L Wrana, Frederick
P Roth, and Marc Vidal (2009). “An experimentally derived confidence score
for binary protein-protein interactions.” In: Nature Methods 6.1, pp. 91–97. doi:
10.1038/nmeth.1281.
Bulman, D E, S B Gangopadhyay, K G Bebchuck, R G Worton, and P N Ray
(1991). “Point mutation in the human dystrophin gene: identification through
western blot analysis.” In: Genomics 10.2, pp. 457–460. doi: 10.1016/0888-
7543(91)90332-9.
Bystroff, Christopher and Anders Krogh (2008). “Hidden Markov Models for Pre-
diction of Protein Features”. In: Methods in Molecular Biology. Vol. 413. MIMB.
Humana Press, pp. 173–198. doi: 10.1007/978-1-59745-582-4_12.
Campbell, Iain Donald and Martin Baron (1991). “The structure and function of
protein modules”. In: Philosophical Transactions of the Royal Society of London.
Series B: Biological Sciences 332.1263, pp. 199–203. issn: 0962-8436. doi: 10.
1098/rstb.1991.0045.
Chaisson Sanders, Ashley D, Xuefang Zhao, Ankit Malhotra, David Porubsky, To-
bias Rausch, Eugene J Gardner, Oscar L Rodriguez, Li Guo, Ryan L Collins, Xian
Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N Kronenberg, Xiangmeng
Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M Wenger, Alex R Hastie,
Danny Antaki, Thomas Anantharaman, Peter A Audano, Harrison Brand, Stu-
art Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xintong Chen, Chen-
Shan Chin, Zechen Chong, Nelson T Chuang, Christine C Lambert, Deanna M
245
Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, David U
Gorkin, Madhusudan Gujral, Victor Guryev, William Haynes Heaton, Jonas Ko-
rlach, Sushant Kumar, Jee Young Kwon, Ernest T Lam, Jong Eun Lee, Joyce
Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li, Patrick Marks, Karine Viaud-
Martinez, Sascha Meiers, Katherine M Munson, Fabio C P Navarro, Bradley
J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou,
Andy W C Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stütz,
Diana C J Spierings, Alistair Ward, AnneMarie E Welch, Ming Xiao, Wei Xu,
Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Ernesto Lowy, Sergei
Yakneen, Steven McCarroll, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul
Flicek, Ken Chen, Mark B Gerstein, Pui-Yan Kwok, Peter M Lansdorp, Gabor T
Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E Devine, Michael
E Talkowski, Ryan E Mills, Tobias Marschall, Jan O Korbel, Evan E Eichler, and
Charles Lee (2019). “Multi-platform discovery of haplotype-resolved structural
variation in human genomes.” In: Nature Communications 10.1, p. 1784. issn:
2041-1723. doi: 10.1038/s41467-018-08148-z.
Chen Li, S, Y Chen, P L Chen, Z D Sharp, and W H Lee (1996). “The nuclear
localization sequences of the BRCA1 protein interact with the importin-alpha
subunit of the nuclear transport signal receptor.” In: The Journal of Biological
Chemistry 271.51, pp. 32863–32868. doi: 10.1074/jbc.271.51.32863.
Chen, Siwei, Robert Fragoza, Lambertus Klei, Yuan Liu, Jiebiao Wang, Kathryn
Roeder, Bernie Devlin, and Haiyuan Yu (2018). “An interactome perturbation
framework prioritizes damaging missense mutations for developmental disor-
ders.” In: Nature Genetics 50.7, pp. 1032–1040. issn: 1061-4036. doi: 10.1038/
s41588-018-0130-z.
Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Ap-
plebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant,
Rosalia G Schneider, Andrew W Senior, John Jumper, Demis Hassabis, Push-
meet Kohli, and Žiga Avsec (2023). “Accurate proteome-wide missense variant ef-
fect prediction with AlphaMissense.” In: Science 381.6664, eadg7492. issn: 0036-
8075. doi: 10.1126/science.adg7492.
Chien, Bartel, Sternglanz, and Fields (1991). “The two-hybrid system: a method to
identify and clone genes for proteins that interact with a protein of interest”. In:
Proceedings of the National Academy of Sciences U S A 88.21, pp. 9578–9582.
doi: 10.1073/pnas.88.21.9578.
Choi Olivet, Julien, Patricia Cassonnet, Pierre-Olivier Vidalain, Katja Luck, Luke
Lambourne, Kerstin Spirohn, Irma Lemmens, Mélanie Dos Santos, Caroline De-
meret, Louis Jones, Sudharshan Rangarajan, Wenting Bian, Eloi P Coutant,
Yves L Janin, Sylvie van der Werf, Philipp Trepte, Erich E Wanker, Javier De
Las Rivas, Jan Tavernier, Jean-Claude Twizere, Tong Hao, David E Hill, Marc
246
Vidal, Michael A Calderwood, and Yves Jacob (2019). “Maximizing binary inter-
actome mapping with a minimal number of assays.” In: Nature Communications
10.1, p. 3907. doi: 10.1038/s41467-019-11809-2.
ClinVar Miner (2024). ClinVar Miner. Accessed: 2024-09-05.
Consortium, 1000 Genomes Project, Adam Auton, Lisa D Brooks, Richard M
Durbin, Erik P Garrison, Hyun Min Kang, Jan O Korbel, Jonathan L Marchini,
Shane McCarthy, Gil A McVean, and Gonçalo R Abecasis (2015). “A global
reference for human genetic variation.” In: Nature 526.7571, pp. 68–74. issn:
0028-0836. doi: 10.1038/nature15393.
Copley, Richard R, Tobias Doerks, Ivica Letunic, and Peer Bork (2002). “Protein do-
main analysis in the era of complete genomes.” In: FEBS Letters 513.1, pp. 129–
134. doi: 10.1016/s0014-5793(01)03289-6.
Davey, Norman E, M Madan Babu, Martin Blackledge, Alan Bridge, Salvador
Capella-Gutierrez, Zsuzsanna Dosztanyi, Rachel Drysdale, Richard J Edwards,
Arne Elofsson, Isabella C Felli, Toby J Gibson, Aleksandras Gutmanas, John M
Hancock, Jen Harrow, Desmond Higgins, Cy M Jeffries, Philippe Le Mercier,
Balint Mészáros, Marco Necci, Cedric Notredame, Sandra Orchard, Christos A
Ouzounis, Rita Pancsa, Elena Papaleo, Roberta Pierattelli, Damiano Piovesan,
Vasilis J Promponas, Patrick Ruch, Gabriella Rustici, Pedro Romero, Sirarat
Sarntivijai, Gary Saunders, Benjamin Schuler, Malvika Sharan, Denis C Shields,
Joel L Sussman, Jonathan A Tedds, Peter Tompa, Michael Turewicz, Jiri Von-
drasek, Wim F Vranken, Bonnie Ann Wallace, Kanin Wichapong, and Silvio C
E Tosatto (2019). “An intrinsically disordered proteins community for ELIXIR.”
In: F1000Research 8. doi: 10.12688/f1000research.20136.1.
Davey, Norman E, Martha S Cyert, and Alan M Moses (2015). “Short linear motifs -
ex nihilo evolution of protein regulation.” In: Cell Communication and Signaling
13, p. 43. doi: 10.1186/s12964-015-0120-z.
Davey, Norman E, Niall J Haslam, Denis C Shields, and Richard J Edwards (2011).
“SLiMSearch 2.0: biological context for short linear motifs in proteins.” In: Nu-
cleic Acids Research 39.Web Server issue, W56–60. doi: 10.1093/nar/gkr402.
Davey, Norman E, Kim Van Roey, Robert J Weatheritt, Grischa Toedt, Bora Uyar,
Brigitte Altenberg, Aidan Budd, Francesca Diella, Holger Dinkel, and Toby J
Gibson (2012). “Attributes of short linear motifs.” In: Molecular Biosystems 8.1,
pp. 268–281. doi: 10.1039/c1mb05231d.
Dhanoa, Bajinder S, Tiziana Cogliati, Akhila G Satish, Elspeth A Bruford, and
James S Friedman (2013). “Update on the Kelch-like (KLHL) gene family.” In:
Human genomics 7, p. 13. doi: 10.1186/1479-7364-7-13.
Dill, Ken A and Justin L MacCallum (2012). “The protein-folding problem, 50 years
on.” In: Science 338.6110, pp. 1042–1046. doi: 10.1126/science.1219021.
247
Ding Yuan, Fang, Priyadarshan K Damle, Larisa Litovchick, Ronny Drapkin, and
Steven R Grossman (2020). “CtBP determines ovarian cancer cell fate through
repression of death receptors.” In: Cell death & disease 11.4, p. 286. doi: 10.
1038/s41419-020-2455-7.
Doolittle, Russell F. (1995). “The Multiplicity of Domains in Proteins”. In: Annual
Review of Biochemistry 64, pp. 287–314. doi: 10.1146/annurev.bi.64.070195.
001443.
Dosztányi, Peter Csizmok Tompa, and Simon (2005). “IUPred: web server for
the prediction of intrinsically unstructured regions of proteins based on esti-
mated energy content.” In: Bioinformatics 21.16, pp. 3433–3434. doi: 10.1093/
bioinformatics/bti541.
Dosztányi, Zsuzsanna (2018). “Prediction of protein disorder based on IUPred.” In:
Protein Science 27.1, pp. 331–340. doi: 10.1002/pro.3334.
Dragulescu-Andrasi Chan, Carmel T, Abhijit De, Tarik F Massoud, and Sanjiv S
Gambhir (2011). “Bioluminescence resonance energy transfer (BRET) imaging
of protein-protein interactions within deep tissues of living subjects.” In: Pro-
ceedings of the National Academy of Sciences of the United States of America
108.29, pp. 12060–12065. issn: 1091-6490. doi: 10.1073/pnas.1100923108.
Dunker, A. Keith, Celeste J. Brown, and Zoran Obradovic (2002). “Identification
and functions of usefully disordered proteins”. In: Advances in Protein Chemistry
62, pp. 25–49. doi: 10.1016/S0065-3233(02)62004-3.
Dyson Wright, Peter E (2005). “Intrinsically unstructured proteins and their func-
tions.” In: Nature Reviews. Molecular Cell Biology 6.3, pp. 197–208. doi: 10.
1038/nrm1589.
Edwards and Nicolas Palopoli (2014). “Computational Prediction of Short Linear
Motifs from Protein Sequences”. In: Computational Peptidology. Vol. 1268. Meth-
ods in Molecular Biology. Humana Press, pp. 89–141. doi: 10.1007/978-1-
4939-2285-7_5.
Felli, Isabella C. and Roberta Pierattelli (2015). Intrinsically Disordered Proteins
Studied by NMR Spectroscopy. Springer. isbn: 978-3-319-20197-9. doi: 10.1007/
978-3-319-20198-6.
Fields and Song (1989). “A novel genetic system to detect protein–protein interac-
tions”. In: Nature 340, pp. 245–246. doi: 10.1038/340245a0.
Filograna De Tito, Stefano, Matteo Lo Monte, Rosario Oliva, Francesca Bruzzese,
Maria Serena Roca, Antonella Zannetti, Adelaide Greco, Daniela Spano, In-
maculada Ayala, Assunta Liberti, Luigi Petraccone, Nina Dathan, Giuliana
Catara, Laura Schembri, Antonino Colanzi, Alfredo Budillon, Andrea Rosario
Beccari, Pompea Del Vecchio, Alberto Luini, Daniela Corda, and Carmen Va-
lente (2024). “Identification and characterization of a new potent inhibitor tar-
248
geting CtBP1/BARS in melanoma cells.” In: Journal of Experimental & Clinical
Cancer Research 43.1, p. 137. doi: 10.1186/s13046-024-03044-5.
Finn Mistry, Jaina, John Tate, Penny Coggill, Andreas Heger, Joanne E Pollington,
O Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa
Holm, Erik L L Sonnhammer, Sean R Eddy, and Alex Bateman (2010). “The
Pfam protein families database.” In: Nucleic Acids Research 38.Database issue,
pp. D211–22. doi: 10.1093/nar/gkp985.
Finn, Robert D, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eber-
hardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina
Mistry, Erik L L Sonnhammer, John Tate, and Marco Punta (2014). “Pfam:
the protein families database.” In: Nucleic Acids Research 42.Database issue,
pp. D222–30. doi: 10.1093/nar/gkt1223.
Forbes Bindal, Nidhi, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare,
Mingming Jia, Rebecca Shepherd, Kenric Leung, Andrew Menzies, Jon W
Teague, Peter J Campbell, Michael R Stratton, and P Andrew Futreal (2011).
“COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mu-
tations in Cancer.” In: Nucleic Acids Research 39.Database issue, pp. D945–50.
doi: 10.1093/nar/gkq929.
Fouassier, L, C C Yun, J G Fitz, and R B Doctor (2000). “Evidence for ezrin-radixin-
moesin-binding phosphoprotein 50 (EBP50) self-association through PDZ-PDZ
interactions.” In: The Journal of Biological Chemistry 275.32, pp. 25039–25045.
doi: 10.1074/jbc.C000092200.
Fragoza, Robert, Jishnu Das, Shayne D Wierbowski, Jin Liang, Tina N Tran, Siqi
Liang, Juan F Beltran, Christen A Rivera-Erick, Kaixiong Ye, Ting-Yi Wang,
Li Yao, Matthew Mort, Peter D Stenson, David N Cooper, Xiaomu Wei, Alon
Keinan, John C Schimenti, Andrew G Clark, and Haiyuan Yu (2019). “Extensive
disruption of protein interactions by genetic variants across the allele frequency
spectrum in human populations.” In: Nature Communications 10.1, p. 4141. doi:
10.1038/s41467-019-11959-3.
Freedman, M. H. and M. Sela (1966). “Recovery of antigenic activity upon reoxi-
dation of completely reduced polyalanyl rabbit immunoglobulin G”. In: J. Biol.
Chem. 241.10, pp. 2383–2396.
Geist Lee, Chop Yan, Joelle Morgan Strom, José de Jesús Naveja, and Katja Luck
(2024). “Generation of a high confidence set of domain-domain interface types to
guide protein complex structure predictions by AlphaFold.” In: Bioinformatics.
doi: 10.1093/bioinformatics/btae482.
Gilmore, T D (2006). “Introduction to NF-kappaB: players, pathways, perspectives.”
In: Oncogene 25.51, pp. 6680–6684. doi: 10.1038/sj.onc.1209954.
249
Glover Williams, Lee (2004). “Interactions between BRCT repeats and phosphopro-
teins: tangled up in two.” In: Trends in Biochemical Sciences 29.11, pp. 579–585.
doi: 10.1016/j.tibs.2004.09.010.
gnomAD (2024). Genome Aggregation Database (gnomAD). Accessed: 2024-09-05.
Goh, Kwang-Il, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and
Albert-László Barabási (2007). “The human disease network.” In: Proceedings
of the National Academy of Sciences of the United States of America 104.21,
pp. 8685–8690. issn: 0027-8424. doi: 10.1073/pnas.0701361104.
Gouw, Marc, Hugo Sámano-Sánchez, Kim Van Roey, Francesca Diella, Toby J Gib-
son, and Holger Dinkel (2017). “Exploring short linear motifs using the ELM
database and tools.” In: Current Protocols in Bioinformatics 58, pp. 8.22.1–
8.22.35. doi: 10.1002/cpbi.26.
Gouw, Hugo Sámano-Sánchez, Manjeet Kumar, András Zeke, Benjamin Lang,
Benoit Bely, Lućıa B Chemes, Norman E Davey, Ziqi Deng, Francesca Diella,
Clara-Marie Gürth, Ann-Kathrin Huber, Stefan Kleinsorg, Lara S Schlegel,
Nicolás Palopoli, Kim V Roey, Brigitte Altenberg, Attila Reményi, Holger
Dinkel, and Toby J Gibson (2018). “The eukaryotic linear motif resource - 2018
update.” In: Nucleic Acids Research 46.D1, pp. D428–D434. issn: 0305-1048.
doi: 10.1093/nar/gkx1077.
Goyet, Elise, Nathalie Bouquier, Vincent Ollendorff, and Julie Perroy (2016). “Fast
and high resolution single-cell BRET imaging”. In: Scientific Reports 6, Article
28231. doi: 10.1038/srep28231.
Grozinger, C M and S L Schreiber (2000). “Regulation of histone deacetylase 4
and 5 and transcriptional activity by 14-3-3-dependent cellular localization.” In:
Proceedings of the National Academy of Sciences of the United States of America
97.14, pp. 7835–7840. doi: 10.1073/pnas.140199597.
Grünberg, Raik, Julia V Burnier, Tony Ferrar, Violeta Beltran-Sastre, François
Stricher, Almer M van der Sloot, Raquel Garcia-Olivas, Arrate Mallabiabarrena,
Xavier Sanjuan, Timo Zimmermann, and Luis Serrano (2013). “Engineering of
weak helper interactions for high-efficiency FRET probes”. In: Nature Methods
10.10, pp. 1021–1027. doi: 10.1038/nmeth.2625.
Gupta, Vandana A and Alan H Beggs (2014). “Kelch proteins: emerging roles in
skeletal muscle development and diseases.” In: Skeletal muscle [electronic re-
source] 4, p. 11. doi: 10.1186/2044-5040-4-11.
Hall Unch, James, Brock F Binkowski, Michael P Valley, Braeden L Butler, Monika
G Wood, Paul Otto, Kristopher Zimmerman, Gediminas Vidugiris, Thomas
Machleidt, Matthew B Robers, Hélène A Benink, Christopher T Eggers, Michael
R Slater, Poncho L Meisenheimer, Dieter H Klaubert, Frank Fan, Lance P En-
cell, and Keith V Wood (2012). “Engineered luciferase reporter from a deep sea
250
shrimp utilizing a novel imidazopyrazinone substrate.” In: ACS Chemical Biology
7.11, pp. 1848–1857. doi: 10.1021/cb3002478.
Hamosh Scott, Alan F, Joanna S Amberger, Carol A Bocchini, and Victor A McKu-
sick (2005). “Online Mendelian Inheritance in Man (OMIM), a knowledgebase
of human genes and genetic disorders.” In: Nucleic Acids Research 33.Database
issue, pp. D514–7. doi: 10.1093/nar/gki033.
Han, J.-D. J., N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D.
Dupuy, A. J. M. Walhout, M. E. Cusick, F. P. Roth, and M. Vidal (2004).
Evidence for dynamically organized modularity in the yeast protein–protein in-
teraction network. doi: 10.1038/nature02654.
Harris, B Z and W A Lim (2001). “Mechanism and role of PDZ domains in signaling
complex assembly.” In: Journal of Cell Science 114.Pt 18, pp. 3219–3231. doi:
10.1242/jcs.114.18.3219.
Hayden, Matthew S and Sankar Ghosh (2008). “Shared principles in NF-kappaB
signaling.” In: Cell 132.3, pp. 344–362. doi: 10.1016/j.cell.2008.01.020.
Holmstrom, Erik D and David J Nesbitt (2016). “Biophysical Insights from
Temperature-Dependent Single-Molecule Förster Resonance Energy Transfer.”
In: Annual review of physical chemistry 67, pp. 441–465. doi: 10.1146/annurev-
physchem-040215-112544.
Hsu, Lih-Ching (2007). “Identification and functional characterization of a PP1-
binding site in BRCA1.” In: Biochemical and Biophysical Research Communica-
tions 360.2, pp. 507–512. doi: 10.1016/j.bbrc.2007.06.090.
Huttlin, Edward L., Raphael J. Bruckner, Joao A. Paulo, Joe R. Cannon, Lily Ting,
Kurt Baltier, Greg Colby, Fana Gebreab, Melanie P. Gygi, Hannah Parzen, John
Szpyt, Stanley Tam, Gabriela Zarraga, Laura Pontano-Vaites, Sharan Swarup,
Anne E. White, Devin K. Schweppe, Ramin Rad, Brian K. Erickson, Robert A.
Obar, K. G. Guruharsha, Kejie Li, Spyros Artavanis-Tsakonas, Steven P. Gygi,
and J. Wade Harper (2017). “Architecture of the human interactome defines
protein communities and disease networks”. In: Nature 545.7655, pp. 505–509.
issn: 0028-0836. doi: 10.1038/nature22366.
Huttlin, Edward L., Richard J. Bruckner, Javier Navarrete-Perea, Jeffrey R. Cannon,
Kevin Baltier, Fasil Gebreab, Martha P. Gygi, Austin Thornock, Genaro Zarraga,
Shawn Tam, et al. (2021). “Dual proteome-scale networks reveal cell-specific
remodeling of the human interactome”. In: Cell 184.11, 3022–3040.e28. doi: 10.
1016/j.cell.2021.04.011.
Iakoucheva, Lilia M, Celeste J Brown, J David Lawson, Zoran Obradović, and A
Keith Dunker (2002). “Intrinsic disorder in cell-signaling and cancer-associated
proteins.” In: Journal of Molecular Biology 323.3, pp. 573–584. issn: 0022-2836.
doi: 10.1016/s0022-2836(02)00969-5.
251
Idrees, Sobia and Keshav Raj Paudel (2024). “Proteome-wide assessment of hu-
man interactome as a source of capturing domain-motif and domain-domain
interactions.” In: Journal of cell communication and signaling 18.1, e12014. doi:
10.1002/ccs3.12014.
Ingham Colwill, Karen, Caley Howard, Sabine Dettwiler, Caesar S H Lim, Joanna
Yu, Kadija Hersi, Judith Raaijmakers, Gerald Gish, Geraldine Mbamalu, Lorne
Taylor, Benny Yeung, Galina Vassilovski, Manish Amin, Fu Chen, Liudmila
Matskova, Gösta Winberg, Ingemar Ernberg, Rune Linding, Paul O’donnell,
Andrei Starostine, Walter Keller, Pavel Metalnikov, Chris Stark, and Tony Paw-
son (2005). “WW domains provide a platform for the assembly of multipro-
tein networks.” In: Molecular and Cellular Biology 25.16, pp. 7092–7106. doi:
10.1128/{MCB}.25.16.7092-7106.2005.
Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov,
Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Ž́ıdek,
Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J
Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub
Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy,
Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer,
Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W Senior, Koray
Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis (2021). “Highly accurate
protein structure prediction with AlphaFold.” In: Nature 596.7873, pp. 583–589.
issn: 0028-0836. doi: 10.1038/s41586-021-03819-2.
Karczewski Francioli, Laurent C, Grace Tiao, Beryl B Cummings, Jessica Alföldi,
Qingbo Wang, Ryan L Collins, Kristen M Laricchia, Andrea Ganna, Daniel P
Birnbaum, Laura D Gauthier, Harrison Brand, Matthew Solomonson, Nicholas
A Watts, Daniel Rhodes, Moriel Singer-Berk, Eleina M England, Eleanor G
Seaby, Jack A Kosmicki, Raymond K Walters, Katherine Tashman, Yossi Far-
joun, Eric Banks, Timothy Poterba, Arcturus Wang, Cotton Seed, Nicola Whif-
fin, Jessica X Chong, Kaitlin E Samocha, Emma Pierce-Hoffman, Zachary Zap-
pala, Anne H O’Donnell-Luria, Eric Vallabh Minikel, Ben Weisburd, Monkol
Lek, James S Ware, Christopher Vittal, Irina M Armean, Louis Bergelson, Kris-
tian Cibulskis, Kristen M Connolly, Miguel Covarrubias, Stacey Donnelly, Steven
Ferriera, Stacey Gabriel, Jeff Gentry, Namrata Gupta, Thibault Jeandet, Diane
Kaplan, Christopher Llanwarne, Ruchi Munshi, Sam Novod, Nikelle Petrillo,
David Roazen, Valentin Ruano-Rubio, Andrea Saltzman, Molly Schleicher, Jose
Soto, Kathleen Tibbetts, Charlotte Tolonen, Gordon Wade, Michael E Talkowski,
Genome Aggregation Database Consortium, Benjamin M Neale, Mark J Daly,
and Daniel G MacArthur (2020). “The mutational constraint spectrum quanti-
fied from variation in 141,456 humans.” In: Nature 581.7809, pp. 434–443. issn:
0028-0836. doi: 10.1038/s41586-020-2308-7.
252
Kim, Jiho and Regis Grailhe (2016). “Nanoluciferase signal brightness using furi-
mazine substrates opens bioluminescence resonance energy transfer to widefield
microscopy”. In: Cytometry Part A. doi: 10.1002/cyto.a.22870.
￿ (2024). “Nanoluciferase signal brightness using furimazine substrates opens biolu-
minescence resonance energy transfer to widefield microscopy”. In: Brief Report.
Free Access.
Kim, Mi-Sung, M Waseem Akhtar, Megumi Adachi, Melissa Mahgoub, Rhonda
Bassel-Duby, Ege T Kavalali, Eric N Olson, and Lisa M Monteggia (2012).
“An essential role for histone deacetylase 4 in synaptic plasticity and mem-
ory formation.” In: The Journal of Neuroscience 32.32, pp. 10879–10886. doi:
10.1523/{JNEUROSCI}.2089-12.2012.
Klug, Aaron (2010). “The discovery of zinc fingers and their applications in gene
regulation and genome manipulation.” In: Annual Review of Biochemistry 79,
pp. 213–231. doi: 10.1146/annurev-biochem-010909-095056.
Kobayashi, Hiroyuki, Louis-Philippe Picard, Anne-Marie Schönegge, and Michel
Bouvier (2019). “Bioluminescence resonance energy transfer-based imaging of
protein-protein interactions in living cells.” In: Nature Protocols 14.4, pp. 1084–
1107. doi: 10.1038/s41596-019-0129-7.
Koipally Georgopoulos, K (2000). “Ikaros interactions with CtBP reveal a repres-
sion mechanism that is independent of histone deacetylase activity.” In: The
Journal of Biological Chemistry 275.26, pp. 19594–19602. doi: 10.1074/jbc.
M000254200.
Koonin, E V (1996). “Pseudouridine synthases: four families of enzymes containing
a putative uridine-binding motif also conserved in dUTPases and dCTP deami-
nases.” In: Nucleic Acids Research 24.12, pp. 2411–2415. doi: 10.1093/nar/24.
12.2411.
Kornau, H C, L T Schenker, M B Kennedy, and P H Seeburg (1995). “Domain inter-
action between NMDA receptor subunits and the postsynaptic density protein
PSD-95.” In: Science 269.5231, pp. 1737–1740. doi: 10.1126/science.7569905.
Kumar, Manjeet, Sushama Michael, Jesús Alvarado-Valverde, András Zeke, Tamas
Lazar, Juliana Glavina, Eszter Nagy-Kanta, Juan Mac Donagh, Zsofia E Kalman,
Stefano Pascarelli, Nicolas Palopoli, László Dobson, Carmen Florencia Suarez,
Kim Van Roey, Izabella Krystkowiak, Juan Esteban Griffin, Anurag Nagpal,
Rajesh Bhardwaj, Francesca Diella, Bálint Mészáros, Kellie Dean, Norman E
Davey, Rita Pancsa, Lućıa B Chemes, and Toby J Gibson (2024). “ELM-the Eu-
karyotic Linear Motif resource-2024 update.” In: Nucleic Acids Research 52.D1,
pp. D442–D455. doi: 10.1093/nar/gkad1058.
Lacoste, Jessica, Marzieh Haghighi, Shahan Haider, Zhen-Yuan Lin, Dmitri Segal,
Chloe Reno, Wesley Wei Qian, Xueting Xiong, Hamdah Shafqat-Abbasi, Pearl V
Ryder, Rebecca Senft, Beth A Cimini, Frederick P Roth, Michael Calderwood,
253
David Hill, Marc Vidal, S Stephen Yi, Nidhi Sahni, Jian Peng, Anne-Claude Gin-
gras, Shantanu Singh, Anne E Carpenter, and Mikko Taipale (2023). “Pervasive
mislocalization of pathogenic coding variants underlying human disorders.” In:
BioRxiv. doi: 10.1101/2023.09.05.556368.
Landrum, Melissa J, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao,
Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey
Hoover, Wonhee Jang, Kenneth Katz, Michael Ovetsky, George Riley, Amanjeev
Sethi, Ray Tully, Ricardo Villamarin-Salomon, Wendy Rubinstein, and Donna R
Maglott (2016). “ClinVar: public archive of interpretations of clinically relevant
variants.” In: Nucleic Acids Research 44.D1, pp. D862–8. doi: 10.1093/nar/
gkv1222.
Lee (2010). “PDZ domains and their binding partners: structure, specificity, and
modification.” In: Cell Communication and Signaling 8, p. 8. doi: 10.1186/
1478-{811X}-8-8.
Lee, Hubrich, Varga, Christian Schäfer, Mareen Welzel, Eric Schumbera, Milena Djo-
kic, Joelle M Strom, Jonas Schönfeld, Johanna L Geist, Feyza Polat, Toby J Gib-
son, Claudia Isabelle Keller Valsecchi, Manjeet Kumar, Ora Schueler-Furman,
and Katja Luck (2024). “Systematic discovery of protein interaction interfaces
using AlphaFold and experimental validation.” In: Molecular Systems Biology
20.2, pp. 75–97. issn: 1744-4292. doi: 10.1038/s44320-023-00005-6.
Lee, Olzmann, Lih-Shen Chin, and Lian Li (2011). “Mutations associated with
Charcot-Marie-Tooth disease cause SIMPLE protein mislocalization and degra-
dation by the proteasome and aggresome-autophagy pathways.” In: Journal of
Cell Science 124.Pt 19, pp. 3319–3331. doi: 10.1242/jcs.087114.
Lek, Monkol, Konrad J Karczewski, Eric V Minikel, Kaitlin E Samocha, Eric Banks,
Timothy Fennell, Anne H O’Donnell-Luria, James S Ware, Andrew J Hill, Beryl
B Cummings, Taru Tukiainen, Daniel P Birnbaum, Jack A Kosmicki, Laramie
E Duncan, Karol Estrada, Fengmei Zhao, James Zou, Emma Pierce-Hoffman,
Joanne Berghout, David N Cooper, Nicole Deflaux, Mark DePristo, Ron Do,
Jason Flannick, Menachem Fromer, Laura Gauthier, Jackie Goldstein, Namrata
Gupta, Daniel Howrigan, Adam Kiezun, Mitja I Kurki, Ami Levy Moonshine,
Pradeep Natarajan, Lorena Orozco, Gina M Peloso, Ryan Poplin, Manuel A Ri-
vas, Valentin Ruano-Rubio, Samuel A Rose, Douglas M Ruderfer, Khalid Shakir,
Peter D Stenson, Christine Stevens, Brett P Thomas, Grace Tiao, Maria T Tusie-
Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David M Altshuler, Diego
Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua,
Jose C Florez, Stacey B Gabriel, Gad Getz, Stephen J Glatt, Christina M Hult-
man, Sekar Kathiresan, Markku Laakso, Steven McCarroll, Mark I McCarthy,
Dermot McGovern, Ruth McPherson, Benjamin M Neale, Aarno Palotie, Shaun
M Purcell, Danish Saleheen, Jeremiah M Scharf, Pamela Sklar, Patrick F Sul-
254
livan, Jaakko Tuomilehto, Ming T Tsuang, Hugh C Watkins, James G Wil-
son, Mark J Daly, Daniel G MacArthur, and Exome Aggregation Consortium
(2016). “Analysis of protein-coding genetic variation in 60,706 humans.” In: Na-
ture 536.7616, pp. 285–291. issn: 0028-0836. doi: 10.1038/nature19057.
Letunic, Ivica, Supriya Khedkar, and Peer Bork (2021). “SMART: recent up-
dates, new developments and status in 2020.” In: Nucleic Acids Research 49.D1,
pp. D458–D460. doi: 10.1093/nar/gkaa937.
Li Wang, Fei, Qiao Wang, Na Zhang, Jumei Zheng, Maiqing Zheng, Ranran Liu,
Huanxian Cui, Jie Wen, and Guiping Zhao (2020). “SPOP promotes ubiquiti-
nation and degradation of MyD88 to suppress the innate immune response.” In:
PLoS Pathogens 16.5, e1008188. doi: 10.1371/journal.ppat.1008188.
Lievens, Peelman, De Bosscher, Lemmens, and Jan Tavernier (2011). “MAPPIT:
a protein interaction toolbox built on insights in cytokine receptor signaling”.
In: Cytokine Growth Factor Reviews 22.5-6, pp. 321–329. doi: 10.1016/j.
cytogfr.2011.11.001.
Lin Smith, Edwin R, Hidehisa Takahashi, Ka Chun Lai, Skylar Martin-Brown, Lau-
rence Florens, Michael P Washburn, Joan W Conaway, Ronald C Conaway,
and Ali Shilatifard (2010). “AFF4, a component of the ELL/P-TEFb elongation
complex and a shared subunit of MLL chimeras, can link transcription elonga-
tion to leukemia.” In: Molecular Cell 37.3, pp. 429–437. issn: 1097-4164. doi:
10.1016/j.molcel.2010.01.026.
Livesey, Benjamin J and Joseph A Marsh (2022). “Interpreting protein variant ef-
fects with computational predictors and deep mutational scanning.” In: Disease
Models & Mechanisms 15.6. doi: 10.1242/dmm.049510.
Luck, Katja, Sebastian Charbonnier, and Gilles Travé (2012). “The emerging con-
tribution of sequence context to the specificity of protein interactions mediated
by PDZ domains”. In: FEBS Letters 586.17, pp. 2648–2661. doi: 10.1016/j.
febslet.2012.03.056.
Luck, Katja, Dae-Kyum Kim, Luke Lambourne, Kerstin Spirohn, Bridget E Begg,
Wenting Bian, Ruth Brignall, Tiziana Cafarelli, Francisco J Campos-Laborie,
Benoit Charloteaux, Dongsic Choi, Atina G Coté, Meaghan Daley, Steven Deim-
ling, Alice Desbuleux, Amélie Dricot, Marinella Gebbia, Madeleine F Hardy,
Nishka Kishore, Jennifer J Knapp, István A Kovács, Irma Lemmens, Miles W
Mee, Joseph C Mellor, Carl Pollis, Carles Pons, Aaron D Richardson, Sadie
Schlabach, Bridget Teeking, Anupama Yadav, Mariana Babor, Dawit Balcha,
Omer Basha, Christian Bowman-Colin, Suet-Feung Chin, Soon Gang Choi, Clau-
dia Colabella, Georges Coppin, Cassandra D’Amata, David De Ridder, Steffi
De Rouck, Miquel Duran-Frigola, Hanane Ennajdaoui, Florian Goebels, Liana
Goehring, Anjali Gopal, Ghazal Haddad, Elodie Hatchi, Mohamed Helmy, Yves
Jacob, Yoseph Kassa, Serena Landini, Roujia Li, Natascha van Lieshout, An-
255
drew MacWilliams, Dylan Markey, Joseph N Paulson, Sudharshan Rangarajan,
John Rasla, Ashyad Rayhan, Thomas Rolland, Adriana San-Miguel, Yun Shen,
Dayag Sheykhkarimli, Gloria M Sheynkman, Eyal Simonovsky, Murat Taşan,
Alexander Tejeda, Vincent Tropepe, Jean-Claude Twizere, Yang Wang, Robert
J Weatheritt, Jochen Weile, Yu Xia, Xinping Yang, Esti Yeger-Lotem, Quan
Zhong, Patrick Aloy, Gary D Bader, Javier De Las Rivas, Suzanne Gaudet,
Tong Hao, Janusz Rak, Jan Tavernier, David E Hill, Marc Vidal, Frederick P
Roth, and Michael A Calderwood (2020). “A reference map of the human binary
protein interactome.” In: Nature 580.7803, pp. 402–408. issn: 0028-0836. doi:
10.1038/s41586-020-2188-x.
Ludes-Meyers Kil, Hyunsuk, Andrzej K Bednarek, Jeff Drake, Mark T Bedford, and
C Marcelo Aldaz (2004). “WWOX binds the specific proline-rich ligand PPXY:
identification of candidate interacting proteins.” In: Oncogene 23.29, pp. 5049–
5055. doi: 10.1038/sj.onc.1207680.
Luo Lin, Chengqi, Erin Guest, Alexander S Garrett, Nima Mohaghegh, Selene Swan-
son, Stacy Marshall, Laurence Florens, Michael P Washburn, and Ali Shilatifard
(2012). “The super elongation complex family of RNA polymerase II elongation
factors: gene target specificity and transcriptional output.” In: Molecular and
Cellular Biology 32.13, pp. 2608–2617. doi: 10.1128/{MCB}.00182-12.
Luo, X., Q. He, Y. Huang, and M. S. Sheikh (2005). “Cloning and characteriza-
tion of a p53 and DNA damage down-regulated gene PIQ that codes for a novel
calmodulin-binding IQ motif protein and is up-regulated in gastrointestinal can-
cers”. In: Cancer Research 65, pp. 10725–10733.
Martino, Elisa, Sara Chiarugi, Francesco Margheriti, and Gianpiero Garau (2021).
“Mapping, structure and modulation of PPI”. In: Frontiers in Chemistry 9,
p. 718405. doi: 10.3389/fchem.2021.718405.
Melhuish Wotton, D (2000). “The interaction of the carboxyl terminus-binding pro-
tein with the Smad corepressor TGIF is disrupted by a holoprosencephaly muta-
tion in TGIF.” In: The Journal of Biological Chemistry 275.50, pp. 39762–39766.
doi: 10.1074/jbc.C000416200.
Mészáros, Bálint, István Simon, and Zsuzsanna Dosztányi (2009). “Prediction of
protein binding regions in disordered proteins.” In: PLoS Computational Biology
5.5, e1000376. doi: 10.1371/journal.pcbi.1000376.
Meyer Kirchner, Marieluise, Bora Uyar, Jing-Yuan Cheng, Giulia Russo, Luis R
Hernandez-Miranda, Anna Szymborska, Henrik Zauber, Ina-Maria Rudolph,
Thomas E Willnow, Altuna Akalin, Volker Haucke, Holger Gerhardt, Carmen
Birchmeier, Ralf Kühn, Michael Krauss, Sebastian Diecke, Juan M Pascual, and
Matthias Selbach (2018). “Mutations in disordered regions can cause disease
by creating dileucine motifs.” In: Cell 175.1, 239–253.e17. issn: 00928674. doi:
10.1016/j.cell.2018.08.019.
256
Mihalič, Filip, Leandro Simonetti, Girolamo Giudice, Marie Rubin Sander, Richard
Lindqvist, Marie Berit Akpiroro Peters, Caroline Benz, Eszter Kassa, Dilip
Badgujar, Raviteja Inturi, Muhammad Ali, Izabella Krystkowiak, Ahmed Sayadi,
Eva Andersson, Hanna Aronsson, Ola Söderberg, Doreen Dobritzsch, Evangelia
Petsalaki, Anna K Överby, Per Jemth, Norman E Davey, and Ylva Ivarsson
(2023). “Large-scale phage-based screening reveals extensive pan-viral mimicry
of host short linear motifs.” In: Nature Communications 14.1, p. 2409. doi:
10.1038/s41467-023-38015-5.
Mosca, Roberto, Arnaud Céol, Amelie Stein, Roger Olivella, and Patrick Aloy
(2014). “3did: a catalog of domain-based interactions of known three-dimensional
structure”. In: Nucleic Acids Research 42.Database issue, pp. D374–D379. doi:
10.1093/nar/gkt887. eprint: 2013Sep29.
Nesta, Alex V, Denisse Tafur, and Christine R Beck (2021). “Hotspots of human
mutation.” In: Trends in Genetics 37.8, pp. 717–729. issn: 01689525. doi: 10.
1016/j.tig.2020.10.003.
Nooren Thornton, Janet M. (2003). “Diversity of protein–protein interactions”. In:
The EMBO Journal 22.14, pp. 3486–3492. doi: 10.1093/emboj/cdg359.
Northrop, J. H. (1930). “CRYSTALLINE PEPSIN: I. ISOLATION AND TESTS
OF PURITY”. In: The Journal of General Physiology 13.6, pp. 739–766. doi:
10.1085/jgp.13.6.739.
Oldfield, Christopher J and A Keith Dunker (2014). “Intrinsically disordered proteins
and intrinsically disordered protein regions.” In: Annual Review of Biochemistry
83, pp. 553–584. doi: 10.1146/annurev-biochem-072711-164947.
Oliver Bitoun, Emmanuelle, Joanne Clark, Emma L Jones, and Kay E Davies (2004).
“Mediation of Af4 protein function in the cerebellum by Siah proteins.” In: Pro-
ceedings of the National Academy of Sciences of the United States of America
101.41, pp. 14901–14906. doi: 10.1073/pnas.0406196101.
Oxley Anthis, Nicholas J, Edward D Lowe, Ioannis Vakonakis, Iain D Campbell,
and Kate L Wegener (2008). “An integrin phosphorylation switch: the effect of
beta3 integrin tail phosphorylation on Dok1 and talin binding.” In: The Journal
of Biological Chemistry 283.9, pp. 5420–5426. doi: 10.1074/jbc.M709435200.
Paysan-Lafosse, Typhaine, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz
Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, Peer Bork, Alan Bridge,
Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunić, Aron Marchler-Bauer,
Huaiyu Mi, Darren A Natale, Christine A Orengo, Arun P Pandurangan, Cather-
ine Rivoire, Christian J A Sigrist, Ian Sillitoe, Narmada Thanki, Paul D Thomas,
Silvio C E Tosatto, Cathy H Wu, and Alex Bateman (2023). “InterPro in 2022.”
In: Nucleic Acids Research 51.D1, pp. D418–D427. doi: 10.1093/nar/gkac993.
257
Peng, Zhenling, Marcin J Mizianty, Bin Xue, Lukasz Kurgan, and Vladimir N Uver-
sky (2012). “More than just tails: intrinsic disorder in histone proteins.” In:
Molecular Biosystems 8.7, pp. 1886–1901. doi: 10.1039/c2mb25102g.
Pennington, K L, T Y Chan, M P Torres, and J L Andersen (2018). “The dy-
namic and stress-adaptive signaling hub of 14-3-3: emerging mechanisms of reg-
ulation and context-dependent protein-protein interactions.” In: Oncogene 37.42,
pp. 5587–5604. doi: 10.1038/s41388-018-0348-3.
Petsalaki Stark, Alexander, Eduardo Garćıa-Urdiales, and Robert B Russell (2009).
“Accurate prediction of peptide binding sites on protein surfaces.” In: PLoS Com-
putational Biology 5.3, e1000335. doi: 10.1371/journal.pcbi.1000335.
Pfleger Seeber, Ruth M and Karin A Eidne (2006). “Bioluminescence resonance en-
ergy transfer (BRET) for the real-time detection of protein-protein interactions.”
In: Nature Protocols 1.1, pp. 337–345. doi: 10.1038/nprot.2006.52.
Pierce, Michael M., C. S. Raman, and Barry T. Nall (1999). “Isothermal Titration
Calorimetry of Protein–Protein Interactions”. In: Methods 19.2, pp. 213–221. doi:
10.1016/S1046-2023(99)00009-0.
Puntervoll Linding, Rune, Christine Gemünd, Sophie Chabanis-Davidson, Morten
Mattingsdal, Scott Cameron, David M A Martin, Gabriele Ausiello, Barbara
Brannetti, Anna Costantini, Fabrizio Ferrè, Vincenza Maselli, Allegra Via, Gi-
anni Cesareni, Francesca Diella, Giulio Superti-Furga, Lucjan Wyrwicz, Chenna
Ramu, Caroline McGuigan, Rambabu Gudavalli, Ivica Letunic, Peer Bork,
Leszek Rychlewski, Bernhard Küster, Manuela Helmer-Citterich, William N
Hunter, Rein Aasland, and Toby J Gibson (2003). “ELM server: A new resource
for investigating short functional sites in modular eukaryotic proteins.” In: Nu-
cleic Acids Research 31.13, pp. 3625–3630. doi: 10.1093/nar/gkg545.
Ramirez-Martinez, Andres, Bercin Kutluk Cenik, Svetlana Bezprozvannaya, Beibei
Chen, Rhonda Bassel-Duby, Ning Liu, and Eric N Olson (2017). “KLHL41 sta-
bilizes skeletal muscle sarcomeres by nonproteolytic ubiquitination.” In: eLife 6.
doi: 10.7554/{eLife}.26439.
Rodŕıguez, J A and B R Henderson (2000). “Identification of a functional nuclear
export sequence in BRCA1.” In: The Journal of Biological Chemistry 275.49,
pp. 38589–38596. doi: 10.1074/jbc.M003851200.
Rolland, Thomas, Murat Taşan, Benoit Charloteaux, Samuel J Pevzner, Quan
Zhong, Nidhi Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca,
Atanas Kamburov, Susan D Ghiassian, Xinping Yang, Lila Ghamsari, Dawit
Balcha, Bridget E Begg, Pascal Braun, Marc Brehme, Martin P Broly, Anne-
Ruxandra Carvunis, Dan Convery-Zupan, Roser Corominas, Jasmin Coulombe-
Huntington, Elizabeth Dann, Matija Dreze, Amélie Dricot, Changyu Fan, Eric
Franzosa, Fana Gebreab, Bryan J Gutierrez, Madeleine F Hardy, Mike Jin, Shuli
Kang, Ruth Kiros, Guan Ning Lin, Katja Luck, Andrew MacWilliams, Jörg
258
Menche, Ryan R Murray, Alexandre Palagi, Matthew M Poulin, Xavier Ram-
bout, John Rasla, Patrick Reichert, Viviana Romero, Elien Ruyssinck, Julie M
Sahalie, Annemarie Scholz, Akash A Shah, Amitabh Sharma, Yun Shen, Kerstin
Spirohn, Stanley Tam, Alexander O Tejeda, Shelly A Wanamaker, Jean-Claude
Twizere, Kerwin Vega, Jennifer Walsh, Michael E Cusick, Yu Xia, Albert-László
Barabási, Lilia M Iakoucheva, Patrick Aloy, Javier De Las Rivas, Jan Tavernier,
Michael A Calderwood, David E Hill, Tong Hao, Frederick P Roth, and Marc
Vidal (2014). “A proteome-scale map of the human interactome network.” In:
Cell 159.5, pp. 1212–1226. doi: 10.1016/j.cell.2014.10.050.
Sahni, Nidhi, Song Yi, Mikko Taipale, Juan I Fuxman Bass, Jasmin Coulombe-
Huntington, Fan Yang, Jian Peng, Jochen Weile, Georgios I Karras, Yang Wang,
István A Kovács, Atanas Kamburov, Irina Krykbaeva, Mandy H Lam, George
Tucker, Vikram Khurana, Amitabh Sharma, Yang-Yu Liu, Nozomu Yachie, Quan
Zhong, Yun Shen, Alexandre Palagi, Adriana San-Miguel, Changyu Fan, Dawit
Balcha, Amelie Dricot, Daniel M Jordan, Jennifer M Walsh, Akash A Shah, Xin-
ping Yang, Ani K Stoyanova, Alex Leighton, Michael A Calderwood, Yves Jacob,
Michael E Cusick, Kourosh Salehi-Ashtiani, Luke J Whitesell, Shamil Sunyaev,
Bonnie Berger, Albert-László Barabási, Benoit Charloteaux, David E Hill, Tong
Hao, Frederick P Roth, Yu Xia, Albertha J M Walhout, Susan Lindquist, and
Marc Vidal (2015). “Widespread macromolecular interaction perturbations in
human genetic disorders.” In: Cell 161.3, pp. 647–660. doi: 10.1016/j.cell.
2015.04.013.
Sahni, Nidhi, Song Yi, Quan Zhong, Noor Jailkhani, Benoit Charloteaux, Michael E
Cusick, and Marc Vidal (2013). “Edgotype: a fundamental link between genotype
and phenotype.” In: Current Opinion in Genetics & Development 23.6, pp. 649–
657. doi: 10.1016/j.gde.2013.11.002.
Santelli Leone, Marilisa, Chenlong Li, Toru Fukushima, Nicholas E Preece, Arthur J
Olson, Kathryn R Ely, John C Reed, Maurizio Pellecchia, Robert C Liddington,
and Shu-ichi Matsuzawa (2005). “Structural analysis of Siah1-Siah-interacting
protein interactions and insights into the assembly of an E3 ligase multiprotein
complex.” In: The Journal of Biological Chemistry 280.40, pp. 34278–34287. doi:
10.1074/jbc.M506707200.
Schreiber, G, G Haran, and H-X Zhou (2009). “Fundamental aspects of protein-
protein association kinetics.” In: Chemical Reviews 109.3, pp. 839–860. doi: 10.
1021/cr800373w.
Schultz, J., F. Milpetz, P. Bork, and C.P. Ponting (1998). “SMART, a simple mod-
ular architecture research tool: identification of signaling domains”. In: Proceed-
ings of the National Academy of Sciences U.S.A. 95.11, pp. 5857–5864. doi:
10.1073/pnas.95.11.5857.
259
Sekar, Rajesh Babu and Ammasi Periasamy (2003). “Fluorescence resonance energy
transfer (FRET) microscopy imaging of live cell protein localizations”. In: Journal
of Cell Biology 160.5, pp. 629–633. doi: 10.1083/jcb.200210140.
Shaner Lambert, Gerard G., Andrew Chammas, Yuhui Ni, Paula J. Cranfill, Michelle
A. Baird, Brittney R. Sell, John R. Allen, Richard N. Day, Maria Israelsson,
Michael W. Davidson, and Jiwu Wang (2013). “A bright monomeric green flu-
orescent protein derived from Branchiostoma lanceolatum”. In: Nature Methods
10, pp. 407–409. doi: 10.1038/nmeth.2413.
Starita, Lea M, Muhtadi M Islam, Tapahsama Banerjee, Aleksandra I Adamovich,
Justin Gullingsrud, Stanley Fields, Jay Shendure, and Jeffrey D Parvin (2018).
“A Multiplex Homology-Directed DNA Repair Assay Reveals the Impact of More
Than 1,000 BRCA1 Missense Substitution Variants on Protein Function.” In:
American Journal of Human Genetics 103.4, pp. 498–508. issn: 00029297. doi:
10.1016/j.ajhg.2018.07.016.
Stogios, Peter J and Gilbert G Privé (2004). “The BACK domain in BTB-kelch
proteins.” In: Trends in Biochemical Sciences 29.12, pp. 634–637. doi: 10.1016/
j.tibs.2004.10.003.
Sunyaev, S R, F Eisenhaber, I V Rodchenkov, B Eisenhaber, V G Tumanyan, and
E N Kuznetsov (1999). “PSIC: profile extraction from sequence alignments with
position-specific counts of independent observations.” In: Protein Engineering
12.5, pp. 387–394. doi: 10.1093/protein/12.5.387.
Tadokoro Shattil, Sanford J, Koji Eto, Vera Tai, Robert C Liddington, Jose M
de Pereda, Mark H Ginsberg, and David A Calderwood (2003). “Talin binding
to integrin beta tails: a final common step in integrin activation.” In: Science
302.5642, pp. 103–106. issn: 1095-9203. doi: 10.1126/science.1086652.
Taniguchi, Koji and Michael Karin (2018). “NF-, inflammation, immunity and can-
cer: coming of age.” In: Nature Reviews. Immunology 18.5, pp. 309–324. doi:
10.1038/nri.2017.142.
Tompa, Peter (2002). “Intrinsically unstructured proteins.” In: Trends in Biochem-
ical Sciences 27.10, pp. 527–533. doi: 10.1016/s0968-0004(02)02169-2.
￿ (2011). “Unstructural biology coming of age.” In: Current Opinion in Structural
Biology 21.3, pp. 419–425. doi: 10.1016/j.sbi.2011.03.012.
￿ (2012). “Intrinsically disordered proteins: a 10-year recap”. In: Trends in Bio-
chemical Sciences 37.12. Available at: ptompa@vub.ac.be, pp. 509–516. doi:
10.1016/j.tibs.2012.08.009.
Tompa, Peter, Norman E Davey, Toby J Gibson, and M Madan Babu (2014). “A mil-
lion peptide motifs for the molecular biologist.” In: Molecular Cell 55.2, pp. 161–
169. doi: 10.1016/j.molcel.2014.05.032.
Trepte Secker, Christopher, Soon Gang Choi, Julien Olivet, Eduardo Silva Ramos,
Patricia Cassonnet, Sabrina Golusik, Martina Zenkner, Stephanie Beetz, Marcel
260
Sperling, Yang Wang, Tong Hao, Kerstin Spirohn, Jean-Claude Twizere, Michael
A. Calderwood, David E. Hill, Yves Jacob, Marc Vidal, and Erich E. Wanker
(2021). “A quantitative mapping approach to identify direct interactions within
complexomes”. In: BioRxiv. doi: 10.1101/2021.08.25.457734.
Trepte, Kruse, Kostova, Hoffmann, Buntru, Tempelmeier, Secker, Diez, Schulz,
Klockmeier, Zenkner, Golusik, Rau, Schnoegl, Garner, and Erich Wanker (2018).
“LuTHy: a double-readout bioluminescence-based two-hybrid technology for
quantitative mapping of protein-protein interactions in mammalian cells.” In:
Molecular Systems Biology 14.7, e8071. doi: 10.15252/msb.20178071.
Uversky (2014). Intrinsically Disordered Proteins. Switzerland: Springer Interna-
tional Publishing, pp. XV, 61. isbn: 978-3-319-08920-1.
Uversky, Christopher J Oldfield, and A Keith Dunker (2005). “Showing your ID:
intrinsic disorder as an ID for recognition, regulation and cell signaling.” In:
Journal of Molecular Recognition 18.5, pp. 343–384. doi: 10.1002/jmr.747.
Uyar, Bora, Robert J Weatheritt, Holger Dinkel, Norman E Davey, and Toby J Gib-
son (2014). “Proteome-wide analysis of human disease mutations in short linear
motifs: neglected players in cancer?” In: Molecular Biosystems 10.10, pp. 2626–
2642. doi: 10.1039/c4mb00290c.
Valente Luini, Alberto and Daniela Corda (2013). “Components of the
CtBP1/BARS-dependent fission machinery.” In: Histochemistry and Cell Biology
140.4, pp. 407–421. doi: 10.1007/s00418-013-1138-1.
Van Roey, Kim, Toby J Gibson, and Norman E Davey (2012). “Motif switches:
decision-making in cell regulation.” In: Current Opinion in Structural Biology
22.3, pp. 378–385. doi: 10.1016/j.sbi.2012.03.004.
Van Roey, Kim, Bora Uyar, Robert J Weatheritt, Holger Dinkel, Markus Seiler,
Aidan Budd, Toby J Gibson, and Norman E Davey (2014). “Short linear motifs:
ubiquitous and functionally diverse protein interaction modules directing cell reg-
ulation.” In: Chemical Reviews 114.13, pp. 6733–6778. doi: 10.1021/cr400585q.
Velthuis, Aartjan J W te, Philippe A Sakalis, Donald A Fowler, and Christoph P
Bagowski (2011). “Genome-wide analysis of PDZ domain binding reveals inherent
functional overlap within the PDZ interaction network.” In: Plos One 6.1, e16047.
doi: 10.1371/journal.pone.0016047.
Vidal, Marc, Michael E Cusick, and Albert-László Barabási (2011). “Interactome
networks and human disease.” In: Cell 144.6, pp. 986–998. issn: 1097-4172. doi:
10.1016/j.cell.2011.02.016.
Visscher, Peter M, Matthew A Brown, Mark I McCarthy, and Jian Yang (2012).
“Five years of GWAS discovery.” In: American Journal of Human Genetics 90.1,
pp. 7–24. doi: 10.1016/j.ajhg.2011.11.029.
261
Vogel, Christine, Carlo Berzuini, Matthew Bashton, Julian Gough, and Sarah A.
Teichmann (Year). “Supra-domains: Evolutionary units larger than single protein
domains”. In: Journal Name Volume.Issue, pages. doi: 10.XXXX/XXXXXX.
Vogel, Steven S, Christopher Thaler, and Srinagesh V Koushik (2006). “Fanciful
FRET”. In: Science’s STKE 2006.331, re2. doi: 10.1126/stke.3312006re2.
Wakeling, Emma, Meriel McEntagart, Michael Bruccoleri, Charles Shaw-Smith,
Karen L Stals, Matthew Wakeling, Angela Barnicoat, Clare Beesley, DDD
Study, Andrea K Hanson-Kahn, Mary Kukolich, David A Stevenson, Philippe M
Campeau, Sian Ellard, Sarah H Elsea, Xiang-Jiao Yang, and Richard C Caswell
(2021). “Missense substitutions at a conserved 14-3-3 binding site in HDAC4
cause a novel intellectual disability syndrome.” In: HGG advances 2.1, p. 100015.
doi: 10.1016/j.xhgg.2020.100015.
Wang, Jia Chen, and Mingjie Zhang (2010). “Extensions of PDZ domains as im-
portant structural and functional elements.” In: Protein & cell 1.8, pp. 737–751.
doi: 10.1007/s13238-010-0099-6.
Wang, Jiyao, Farideh Chitsaz, Myra K Derbyshire, Noreen R Gonzales, Marc Gwadz,
Shennan Lu, Gabriele H Marchler, James S Song, Narmada Thanki, Roxanne A
Yamashita, Mingzhang Yang, Dachuan Zhang, Chanjuan Zheng, Christopher J
Lanczycki, and Aron Marchler-Bauer (2023). “The conserved domain database
in 2023.” In: Nucleic Acids Research 51.D1, pp. D384–D388. doi: 10.1093/nar/
gkac1096.
Wang Kruhlak, M J, J Wu, N R Bertos, M Vezmar, B I Posner, D P Bazett-Jones,
and X J Yang (2000). “Regulation of histone deacetylase 4 by binding of 14-
3-3 proteins.” In: Molecular and Cellular Biology 20.18, pp. 6904–6912. doi:
10.1128/{MCB}.20.18.6904-6912.2000.
Weatheritt, Robert J and Toby J Gibson (2012). “Linear motifs: lost in
(pre)translation.” In: Trends in Biochemical Sciences 37.8, pp. 333–341. doi:
10.1016/j.tibs.2012.05.001.
Wegener Partridge, Anthony W, Jaewon Han, Andrew R Pickford, Robert C Lid-
dington, Mark H Ginsberg, and Iain D Campbell (2007). “Structural basis of
integrin activation by talin.” In: Cell 128.1, pp. 171–182. issn: 0092-8674. doi:
10.1016/j.cell.2006.10.048.
Wierbowski, Shayne D., Robert Fragoza, Siqi Liang, and Haiyuan Yu (2018). “Ex-
tracting complementary insights from molecular phenotypes for prioritization
of disease-associated mutations”. In: Current Opinion in Systems Biology 11,
pp. 107–116. issn: 24523100. doi: 10.1016/j.coisb.2018.09.006.
Williams, R S, R Green, and J N Glover (2001). “Crystal structure of the BRCT
repeat region from the breast cancer-associated protein BRCA1.” In: Nature
Structural Biology 8.10, pp. 838–842. issn: 1072-8368. doi: 10.1038/nsb1001-
838.
262
Wilson, Carter J, Wing-Yiu Choy, and Mikko Karttunen (2022). “Alphafold2: A role
for disordered protein/region prediction?” In: International Journal of Molecular
Sciences 23.9, p. 4591. doi: 10.3390/ijms23094591.
Wright, Dyson (1999). “Intrinsically unstructured proteins: re-assessing the protein
structure-function paradigm.” In: Journal of Molecular Biology 293.2, pp. 321–
331. doi: 10.1006/jmbi.1999.3110.
￿ (2015). “Intrinsically disordered proteins in cellular signalling and regulation.” In:
Nature Reviews. Molecular Cell Biology 16.1, pp. 18–29. doi: 10.1038/nrm3920.
Xu Piston, D W and C H Johnson (1999). “A bioluminescence resonance energy
transfer (BRET) system: application to interacting circadian clock proteins.” In:
Proceedings of the National Academy of Sciences of the United States of America
96.1, pp. 151–156. doi: 10.1073/pnas.96.1.151.
Yuen, Michaela and Coen A C Ottenheijm (2020). “Nebulin: big protein with big
responsibilities.” In: Journal of muscle research and cell motility 41.1, pp. 103–
124. doi: 10.1007/s10974-019-09565-3.
Zhang, Yingnan, Brent A Appleton, Christian Wiesmann, Ted Lau, Mike Costa,
Rami N Hannoush, and Sachdev S Sidhu (2009). “Inhibition of Wnt signaling by
Dishevelled PDZ peptides.” In: Nature Chemical Biology 5.4, pp. 217–219. doi:
10.1038/nchembio.152.
Zhong, Quan, Nicolas Simonis, Qian-Ru Li, Benoit Charloteaux, Fabien Heuze, Niels
Klitgord, Stanley Tam, Haiyuan Yu, Kavitha Venkatesan, Danny Mou, Venus
Swearingen, Muhammed A Yildirim, Han Yan, Amélie Dricot, David Szeto,
Chenwei Lin, Tong Hao, Changyu Fan, Stuart Milstein, Denis Dupuy, Robert
Brasseur, David E Hill, Michael E Cusick, and Marc Vidal (2009). “Edgetic per-
turbation models of human inherited disorders.” In: Molecular Systems Biology
5, p. 321. doi: 10.1038/msb.2009.80.
Zhou, Huan-Xiang (2012). “Intrinsic disorder: signaling via highly specific but short-
lived association.” In: Trends in Biochemical Sciences 37.2, pp. 43–48. doi: 10.
1016/j.tibs.2011.11.002.
263


Dalmira Hubrich
+49 157 34517760 / dalmiramer@gmail.com / Mainz, Germany / LinkedIn / GitHub
Profile
As a Systems Biologist with a focus on protein networks and interfaces, I gained a solid background in systematic
experimental biology during my PhD, where I also began learning computational skills, including Python and SQL.
While my primary expertise lies in experimental techniques, I am now expanding into bioinformatics and
computational biology, aiming to work more extensively with biological data. I also have a growing interest in artificial
intelligence and its applications in biological research, with a focus on enhancing my computational skills.
Professional Experience
Researcher December 2020-present
Institute of Molecular Biology, Germany
● Established and adapted several techniques (e.g., cloning, site-directed mutagenesis, BRET assay,
bioluminescent imaging) in the lab.
● Curated, extracted, and visualized diverse biological datasets required for my study.
● Provided experimental data and visualization support to PhD students and colleagues.
● Delivered results and contributed to collaborations across multiple projects.
Junior Data Scientist (Part-time) April 2023 - August 2023
Be Factory UG
● Curated, cleaned, and processed large datasets.
● Developed a tool for feature extraction and trained a machine learning model to evaluate products based on
score results.
● Built entity relationships model in SQL to manage data more effectively.
● Automated data management processes to streamline workflows.
Education
● Doctor of Philosophy in Life Sciences|Johannes Gutenber University, Germany|December 2020 - present
● Master in Protein Enginnering and Biochemistry| Okinawa Institute of Sci & Tech| August 2017-2019
● Bachelor of Biological Sciences|Nazarbayev University| September 2011-August 2016
Skills
● Proficient in conducting systematic experimental assays to detect protein-protein interactions (PPIs),
including BRET and ITC.
● Experienced with high-content screening equipment, such as Opera Phenix, for live-cell imaging and
working with software like Harmony for comprehensive image analysis.
● Trained in Python OOP and relevant packages, including pandas, scipy, and numpy for data management
and analysis; matplotlib and seaborn for data plotting and visualization; scikit-learn for machine learning
models .
● Proven track record with over four years of experience successfully overseeing various projects and
collaborating with interdisciplinary teams.
● Competent in presenting complex concepts to diverse audiences
● Organized, adaptable, and always eager to learn and deliver results.
● Experienced with software tools like Git, Bash, Visual Studio Code, PyCharm, SciWheel, Microsoft 365,
Miro, Notion, and Adobe Illustrator.
● Languages: English (fluent), Russian & Kazakh (native speaker), German (A2) and ongoing