Machine Translation (2021) 35:167–203
https://doi.org/10.1007/s10590-021-09266-0
An in‑depth analysis of the individual impact of controlled 
language rules on machine translation output: 
a mixed‑methods approach
Shaimaa Marzouk1 
Received: 15 November 2019 / Accepted: 4 May 2021 / Published online: 11 June 2021 
© The Author(s) 2021
Abstract
Examining the general impact of Controlled Language (CL) rules in the context of 
Machine Translation (MT) has been an area of research for many years. The pre-
sent study focuses on the following question: how do CL rules impact MT output 
individually? By analysing a German corpus-based test suite of technical texts that 
have been translated into English by different MT systems, this study endeavours 
to answer this question at different levels: the general impact of CL rules (rule- and 
system-independent), their impact at rule level (system-independent) as well as at 
rule and system level. The results of five MT systems are analysed and contrasted: 
a rule-based system, a statistical system, two differently constructed hybrid systems, 
and a neural system. For this, a mixed-methods triangulation approach that includes 
error annotation, human evaluation, and automatic evaluation was applied. The data 
was analysed both qualitatively and quantitatively in terms of CL influence on the 
following parameters: number and type of MT errors, style and content quality, and 
scores of two automatic evaluation metrics. In line with many studies, the results 
show a general positive impact of the applied CL rules on the MT output. How-
ever, at rule level, only four rules proved to have positive effects on the aforemen-
tioned parameters; three rules had negative effects on the parameters; and two rules 
did not show any significant impact. At rule and system level, the rules affected the 
MT systems differently, as expected. Rules that had a positive impact on earlier MT 
approaches did not show the same impact on the neural MT approach. Furthermore, 
neural MT delivered distinctly better results than earlier MT approaches, namely 
the highest error-free, style and content quality rates both before and after applying 
the rules, which indicates that neural MT offers a promising solution that no longer 
requires CL rules for improving the MT output.
Keywords Controlled language · Machine translation · Machine translation 
evaluation · Translation quality
*  Shaimaa Marzouk 
 s.marzouk@uni-mainz.de
Extended author information available on the last page of the article
Vol.:(012134 56789)
 168 S. Marzouk 
1 Introduction
Applying Controlled Language (CL) is a common pre-editing technique in the 
technical domain. As early as 1974, the Caterpillar Fundamental English CL was 
specifically developed to improve the comprehensibility and translatability of 
technical documentation (Caterpillar 1974). A CL refers to “an explicitly defined 
restriction of a natural language that specifies constraints on lexicon, grammar, 
and style” (Huijsen 1998). Through the different restrictions imposed by the CL, 
applying CL rules allows for the reduction of sentence length, the avoidance of 
complex sentence structures as well as the elimination of ambiguous vocabulary 
and constructions. Several studies found that the application of CL has a positive 
impact on different aspects of MT. As one of the earlier CLs, Caterpillar Techni-
cal English proved to have a significant positive effect on MT productivity (Kam-
prath et al. 1998). Reuther (2003) further concluded that implementing CL rules 
can improve readability and translatability of machine-translated texts. Another 
study found that the level of controlling the source text has a substantial impact 
on the accuracy of MT output (Nyberg and Mitamura 1996). In addition, more 
recent studies examined the impact of CL on post-editing effort. O’Brien (2006) 
found that CL reduces post-editing time. Another positive impact was linked to 
post-editing productivity (Aikawa et al. 2007). Bernth and Gdaniec (2001) intro-
duced 26 rules for English as a source language that address different text char-
acteristics aimed to increase machine translatability. They tested these rules with 
various commercially available MT systems and claimed that they were general-
izable to different MT systems and language pairs.
Most CL studies have investigated the impact of CL on MT from a holistic 
perspective, i.e. the impact of comprehensive CL rule sets (e.g. Holmback et al. 
1996; Spyridakis et al. 1997; Kamprath et al. 1998; Bernth 1999; Nyberg et al. 
2003; Fiederer and O’Brien 2009). The results of this research provide an overall 
picture of the effect of CL, in which a positive effect of some rules may over-
shadow a negative effect of other rules, which leads to a biased end result. There 
have been a limited number of studies that focused on analysing the influence of 
individual CL rules. The results of these studies (O’Brien 2006; Roturier 2006; 
Roturier et al. 2012) showed that CL rules affected the MT in various ways and to 
different degrees. All of these studies were conducted on CL rules of the English 
language. Roturier et al. (2012) analyzed the impact of the CL rules on MT qual-
ity using automatic translation metrics. The experiments were conducted using a 
phrase-based system (Moses: Koehn et al. (2007)) for the target languages French 
and German. Roturier (2006) also focused on the same target languages, but he 
used a rule-based MT system (Systran) and was interested in analysing the impact 
of CL rules on comprehensibility. O’Brien (2006) examined the impact of CL on 
post-editing effort for German as a target language using a rule-based MT sys-
tem (IBM WebSphere). In a recent study, Marzouk and Hansen-Schirra (2019) 
analysed the impact of a number of CL rules on different MT approaches for the 
language pair German-to-English. Comparing the results of rule-based (RBMT), 
statistical (SMT), hybrid (HMT), and neural (NMT) systems, they found that 
1 3
An in‑depth analysis of the individual impact of controlled… 169
the earlier approaches (RBMT, SMT, and HMT) benefited from the CL rules, 
while the NMT system delivered mostly error-free output both before and after 
the application of the rules, but did show a decrease in quality after applying 
the rules. This paper reports on the same study of Marzouk and Hansen-Schirra 
(2019) shedding more light on the CL rules analysed and providing detailed 
insight into their individual impact on the different approaches.
Given that the MT quality differs depending on the language pair, translation 
direction, domain, and the applied MT system, the impact of each CL rule on the 
MT output should vary in accordance with these variables. Against this background, 
the present study focuses on the technical domain and on one language pair, Ger-
man-to-English. The technical domain is the most common field of application of 
CL. Analysing the language pair German-to-English enables exploring CL rules of 
the German language; a language that has rarely been examined in CL research. In 
doing this the study has kept the variables language pair, translation direction, and 
domain constant in order to analyse and contrast the individual impact of nine CL 
rules on the MT output of four MT approaches (RBMT, SMT, HMT, and NMT). 
Exploring the NMT approach in the context of CL has—to the best of my knowl-
edge—not yet been investigated in previous studies. The analysis was conducted at 
different levels: rule- and system-independent (general impact), at rule level (sys-
tem-independent), at system level as well as both at rule and system level. The anal-
ysis at system level is presented in Marzouk and Hansen-Schirra (2019), while this 
paper covers the other analysis levels.
Identifying the individual impact of CL rules under the different systems allows 
for an effective implementation of those that have a positive impact. This in turn 
would help limit potential drawbacks of CL application (Lehrndorfer and Reuther 
2008; Drewer and Ziegler 2014) such as the additional time both authors and trans-
lators need in order to consider the restrictions imposed by the rules, interruption of 
the writing flow, difficulty of implementing linguistically complex rules when the 
authors are domain experts with a limited linguistic background, and restriction of 
the authors’ creativity.
The next section provides a description of the dataset. Section  3 outlines the 
applied methodology. Section 4 presents the results. The findings are summarised in 
the conclusion. Finally, the study limitations as well as ideas for future research are 
provided.1
2 D ataset
A test suite was created that consists of 216 source sentences (24 sentences per CL 
rule), extracted from a corpus of ten German user manuals for appliances, software, 
and machines. These sentences violated one of the nine analysed CL rules (before 
CL version). The CL rules were applied to each sentence (after CL version). Both 
1 This paper presents some of the results of a PhD thesis; the dissertation will be published soon (Mar-
zouk in press).
1 3
 170 S. Marzouk 
versions were translated into English by five MT systems. Accordingly, the dataset 
consisted of 2160 MT sentences (216 source sentences * 2 versions * 5 systems). 
The entire dataset was analysed applying error annotation. 1100 MT sentences of 
the 2160 were evaluated by humans. Reasoning and selection criteria of the human-
evaluated sentences are detailed under Human Evaluation in the Methodology 
section.
The source sentences were extracted using the CLAT CL checker. CLAT 
(Rösener 2010) is one of the most well-known CL checkers in Germany that has 
been developed by the Society for the Promotion of Applied Information Sciences 
(IAI) at Saarland University. Thanks to research cooperation with the IAI,2 a license 
for CLAT was provided for research purposes in the present study. One considera-
tion while creating the test suite was to have all user manuals represented as bal-
anced as possible.
The rules investigated were taken from the tekom e. V. Guidelines for Technical 
Writing in the German Language “Leitlinie—Regelbasiertes Schreiben—Deutsch 
für die Technische Kommunikation” (tekom 2013). The tekom guidelines are widely 
implemented in Germany, both in research and industry. Thanks to close collabo-
ration between academia, industry, service providers and software companies, the 
tekom rules provide a comprehensive set of rules across all language and documen-
tation levels (tekom 2013). The nine rules analysed were selected based on three cri-
teria: (i) rules that can be applied to just one sentence, as the analysis is conducted 
at sentence level; (ii) rules that can be applied in all respective sentences according 
to one fixed pattern (see Table 1 “How the rules were applied”) in order to limit the 
number of independent variables; and (iii) rules for which the so-called CL position 
can be defined. Since different factors jointly influence the entire MT output at the 
same time (e.g. the MT system approach, training data together with the applica-
tion or non-application of the CL rule etc.), it was necessary for the analysis to only 
focus on the word or word group directly related to the CL rule, referred to as the CL 
position. The CL position was defined as the part of the source sentence that has to 
be modified in order to apply the CL rule and its equivalence in the target sentence. 
A rule like “formulate short sentences” refers to the whole sentence. Therefore, it is 
not possible to define a specific CL position that can be analysed and compared in 
the error annotation or by human evaluators. Based on these criteria, the nine rules 
depicted in Table 1 were analysed.
The five MT systems examined were: the hybrid MT system “Bing” by Micro-
soft,3 the neural MT system “Google Translate” (Wu et al. 2016),4 the rule-based 
MT system “Lucy LT KWIK Translator” (cf. Alonso Martín and Serra 2014),5 
the statistical MT system “SDL Free Translation”,6 and another hybrid MT sys-
tem “Systran”.7 Since hybrid systems are structured differently, the study uses two 
2 http:// www.i ai- sb. de/d e/ produk te/ clat.
3 https://w ww. bing.c om/t rans lator/.
4 https:// transl ate.g oogle. de/.
5 http://w ww.l ucys oftwar e.c om/e ngli sh/ machi ne-t rans lation/ lucy- lt- kwik- transl ator-/.
6 https:// www.f reet ransl ation.c om/ de/.
7 http:// www.s ystr anet.c om/ trans late.
1 3
An in‑depth analysis of the individual impact of controlled… 171
1 3
Table 1  Analysed CL rules and their application pattern
Analysed rules How the rules were applied
Rule 1 (anz) Using straight quotes for interface texts By entering the interface text between straight quotes, see Example 1
Rule 2 (fvg) Avoiding light-verb construction (Funktionsverbgefüge) By using the meaning-bearing verb instead of the light verb construction, see Example 2
Rule 3 (kos) Formulating conditions as ‘if’ sentences By starting the conditional sentence by “if” instead of the verb, see Example 3
Rule 4 (nsp) Using unambiguous pronominal references By replacing the pronoun by its pronominal reference, see Example 4
Rule 5 (pak) Avoiding participial constructions By generating a subordinate clause based on the participial construction, see Example 5a and 5b
Rule 6 (pas) Avoiding passives By using the active voice, see Example 6
Rule 7 (per) Avoiding the construction sein + zu + infinitive By using the imperative instead of the construction sein + zu + infinitive, see Example 7
Rule 8 (prä) Avoiding superfluous prefixes By eliminating the superfluous prefix, see Example 8
Rule 9 (wte) Avoiding omitting parts of the words Missing parts of words were completed, see Example 9
 172 S. Marzouk 
hybrid systems: Bing is a statistical MT system with language-specific rule compo-
nents, while Systran was originally a rule-based system and was later further devel-
oped into a hybrid system (cf. Werthmann and Witt 2014, p. 84). Accordingly, both 
yielded different outputs. The selection criteria of the systems were as follows: (i) to 
be an online freely available system, (ii) to offer the language pair German-to-Eng-
lish, and (iii) to cover different MT approaches. The systems examined are therefore 
generic black-box systems. A black-box system is “a system which has been trained 
and tuned a priori and for which we cannot access the model parameters or training 
data for fine-tuning or improvements” (Mehta et  al. 2020, p. 1). Accordingly, the 
systems were not trained in advance with specific relevant corpora. Such training 
would have an impact on the results (i.e. better results in the controlled scenario if 
the corpora were controlled and vice versa) (cf. Reuther 2003). In addition, a reason-
able comparison of the results of the different systems would not be feasible, as the 
corpus-based systems would have—depending on the degree of control—an advan-
tage or disadvantage over the rule-based system. In order to overcome the expected 
difficulty of machine translating company-specific and specialist terms, such terms 
were replaced with common terms. A detailed description of the corpus-based test 
suite and its preparation steps is provided in Marzouk and Hansen-Schirra (2019). 
The dataset was machine-translated at the end of 2016.
3 M ethodology
A triphasic mixed-methods triangulation approach was applied that incorporates 
three evaluation methods: error annotation, human evaluation, and automatic evalu-
ation. These methods were carried out in the order shown below.
3.1  Error annotation
The goal of the error annotation is to identify the MT errors before and after apply-
ing the CL rules and compare them in terms of their number and type. The anno-
tation was conducted by a qualified experienced German-English translator and 
checked by two professional German-English translators. Due to the large number of 
MT sentences (2160 sentences), each evaluator separately checked different halves 
of the set of sentences. Each evaluator had to indicate whether they agreed with the 
annotation or not. If not, they had to reannotate the translation. The percentage of 
reannotated sentences was 27% by the first evaluator and 31% by the second evalua-
tor. In case of reannotation, the other evaluator checked both annotations and chose 
one.
Furthermore, based on the existence or non-existence of errors within the CL 
position, the data was divided into four groups, referred to as annotation groups. 
1 3
An in‑depth analysis of the individual impact of controlled… 173
These are: FF (for False–False)—translation contains error before and after CL; FR 
(for False-Right)—translation contains error only before CL; RF (for Right-False)—
translation contains error only after CL; RR (for Right–Right): no errors before and 
after CL. Table 2 shows the error classification (cf. Vilar et al. 2006) applied.
The error taxonomy of Vilar et al. (2006) was used as a basis of the error annota-
tion due to its explicitness, integrity and appropriate degree of granularity. However, 
further more extensive taxonomies, such as the Multidimensional Quality Metrics 
(MQM) framework (Lommel 2018) can be also used for the analysis. This would be 
particularly useful in case of examining fine-grained or more specific types of errors.
For the comparison of the number of errors before vs. after CL, the significance 
test Wilcoxon was used, since the analysed variables were ordinal. For the compari-
son of the error types before CL vs after CL, the McNemar significance test was 
used, which is designed for related dichotomous variables. A significant difference 
is realised at p < 0.05.
3.2 H uman evaluation
The goal of the human evaluation is to compare the content and style quality of the 
MT within the CL position (not the quality of the entire sentence). Following the 
quality definition of Hutchins and Somers (1992):
• The content quality is the extent to which the translation reflects the information 
in the source text accurately; and the extent to which the translation is easy to 
understand. (ibid.)
• The style quality is the extent to which the translation sounds natural and idi-
omatic in Standard Written English, is appropriate for the intention of its con-
tent (ibid., Fiederer and O’Brien 2009) as well as presented clearly in terms of 
orthography. The definition covers orthography as an instrument for presenting 
the content in an adequate way that serves its intention.
Based on these definitions, the content quality covers the criteria accuracy and 
clarity; the style quality encompasses the criteria idiomaticity, appropriateness to the 
content intention as well as correctness and clarity of the orthographic presentation.
As the human evaluation aimed to compare the content and style quality of the 
MT within the CL position (not the quality of the entire sentence), it was necessary 
to initially correct all errors outside the CL position by applying the fewest possible 
edits to the MT output. This preliminary step was essential in order to keep the eval-
uators focused on the CL position. Otherwise, the participants would have evaluated 
the entire MT, commenting on all errors, although not all errors are related to the 
CL rule. As a result of this step, both versions of the MT (before and after applying 
the CL rule) were identical except for the CL position (see examples provided in the 
Results section).
Such corrections were not possible for all annotated sentences without affect-
ing the CL position. Therefore, the study defined certain criteria that an MT has 
to fulfil in order to be included for human evaluation: For the MT sentences that 
1 3
 174 S. Marzouk 
contained errors within the CL position, only MT sentences with a maximum of two 
wrong words were included in the human evaluation. For the MT sentences that did 
not contain errors within the CL position, only MT sentences with a maximum of 
three wrong words were included in the human evaluation. The goal of setting these 
criteria was to avoid making too many corrections in the MT that may impact the 
evaluation of the CL position. For example, in the rule “Avoiding the construction 
sein + zu + infinitive”, the sentence “Das Kaufdatum ist durch eine Kaufquittung zu 
belegen” was translated by Bing as “By a purchase receipt to prove [THE]8 date of 
purchase”. In addition to the wrong translation of the CL position “ist zu belegen”, 
the MT included other errors outside the CL position. Obviously, correcting these 
errors would have substantially changed the MT. The total number of excluded MT 
sentences according to these criteria was 595 (Excluded.1).
To assure the idiomaticity of the MT sentences after correcting the errors out-
side the CL position, two professional translators checked the MT output for stylistic 
acceptability. The focus of this acceptability check was to ensure that the MT out-
side the CL position remained stylistically acceptable after applying the fewest pos-
sible edits. If an MT sentence was evaluated as unacceptable by both translators, the 
MT sentence was excluded from the human evaluation. However, exclusion cases 
due to stylistic non-acceptance were rare; only 15 sentences (Excluded.2).
After undertaking these steps, the total number of MT sentences excluded from 
the human evaluation was 610 MT (sum of Excluded.1 + Excluded.2), see Table 3. 
The remaining 1550 (out of 2160) MT sentences included a total of 545 MT sen-
tences that were identical across different systems, i.e. the source sentences were 
identically translated by different systems. It would have been ineffective to ask 
to the participants to evaluate 545 identical sentences. Therefore, for each source 
sentence, only one instance of the identical MT sentences was human-evaluated; 
95 instances out of 545 are included in the 1100 human-evaluated MT sentences. 
The human evaluation scores of these 95 instances were then applied to the other 
repeated instances (450 out of 545). Thus, the results reported below are based on 
the total number of sentences of 1550 MT sentences (1100 + 450).
In the next step, the researcher verified whether the annotation groups within the 
1550 MT sentences were comparable in the error annotation and human evaluation 
(Table 4). For example, in the error annotation 44.5% of MT sentences were error-
free both before and after the CL application (group RR); in the human evaluation 
the analysed percentage of the group RR was comparable (44.2%).
The distribution of the 1550 MT sentences of the human evaluation across the 
MT systems is shown in Table 5:
The human evaluation (Fig. 1) consisted of:
– Evaluating the style and content quality of the MT (see (*) in Fig.  1) on two 
5-point Likert scales ((1) in Fig. 1);
– Selecting the relevant quality criteria that justify the assigned quality scores: 
accuracy and clarity under the content quality; idiomaticity, appropriateness to 
8 [THE] was not included in the MT.
1 3
An in‑depth analysis of the individual impact of controlled… 175
the content intention as well as correctness and clarity of the orthographic pres-
entation under the style quality ((2) in Fig. 1);
– Providing the word or part of the translation relevant to each chosen criterion ((3) 
in Fig. 1);
– If many modifications were necessary, the participant had to enter an alternative 
translation for the whole sentence ((4) in Fig. 1).
Regarding the participants, different studies recommend recruiting more than 
3–4 participants (Fiederer and O’Brien 2009). In this study, five participants ini-
tially carried out the tests and the number of participants was successively increased 
until the accumulated average of the quality values stabilised. After the eighth par-
ticipant was added, the accumulated quality averages remained largely unchanged. 
Accordingly, the number of participants was not increased further. The participants 
are native English speakers and hold a bachelor’s degree in translation. In addition, 
all participants were students in the last or penultimate semester of a master’s degree 
program in translation. Each participant had to evaluate the entire set of 1100 MT 
sentences. Participation was remunerated.
Regarding the test procedure, the 1100 MT sentences were randomised and split 
into 44 tests. Each participant had the opportunity to choose whether to rate one, 
two or three tests per day, depending on his or her availability. The basic require-
ment was to evaluate at least one test daily, thus avoiding interruptions that could 
possibly have a negative effect on the intra-rater agreement. In addition, the partici-
pants were asked to take a break between the tests. The 44 tests were sent in a differ-
ent randomised order to the participants, e.g. the 1st participant received test 40, test 
8, test 5 consecutively. A decreasing motivation over a 3–4-week evaluation period 
is unavoidable. Therefore, this randomisation ensured that no particular sentences 
were evaluated by all participants at the end of the evaluation. The tester received 
the completed tests every day and checked them for completeness (i.e. all sentences 
were rated and commented if necessary). In case of any missing data, the participant 
was asked to complete them, then he or she received the new tests for the next day.
For the comparison of the style and content quality before vs. after CL, the Wil-
coxon test was used, as not all quality variables were normally distributed. In order 
to measure the correlation between the error types and the quality, the Spearman 
correlation test was used because one of the analysed variables was ordinal.
3.3 A utomatic evaluation
The alternative translation obtained from the human evaluation acted as a reference 
translation for the automatic evaluation metrics (AEMs) in order to compare their 
scores before and after applying each CL rule. Two reference translations per sen-
tence were randomly selected for the comparison. The study applied the TERbase 
and hLEPOR evaluation metrics. The former is a basic edit distance metric that cal-
culates the minimum number of edits needed to change the evaluated MT so that it 
exactly matches the reference translation and works without stemming, synonymy 
lookup and paraphrase support (Snover et al. 2006, Gonzàlez and Giménez 2014). 
1 3
 176 S. Marzouk 
It was necessary to consider the use of synonyms as an edit, as the participants quite 
often recommended the use of a certain synonym while evaluating the translation 
accuracy. TERbase works with negative values; its score ranges between − 1 (worst 
value) and 0 (best value). At the same time, hLEPOR was applied as one of the 
advanced metrics that has proven to have a state-of-the-art correlation with human 
evaluation compared with metrics like BLEU (Papineni et al. 2002), TER (Snover 
et al. 2006), and METEOR (Banerjee and Lavie 2005) amongst others (Han et al. 
2013). The calculation model of hLEPOR is based on three factors: an enhanced 
length penalty, an N-gram position difference penalty and the harmonic mean of 
precision and recall (Han et al. 2013). hLEPOR works with positive values; its score 
ranges between 0 (worst value) and 1 (best value). The impact of the CL application 
was measured on the basis of the difference of the “mean after CL” minus the “mean 
before CL”. Therefore, a positive difference indicates an improvement in the AEM 
score, and conversely, a negative difference indicates a deterioration in the AEM 
score.
Finally, using the Spearman correlation test, the study investigated how the dif-
ference in the AEMs scores (after CL minus before CL) of TERbase and hLEPOR 
correlates with the difference in the overall quality.9 The Spearman correlation test 
was used because not all variables were normally distributed.
4  Results
4.1 T he general impact of CL application
The results of the error annotation showed that the number of errors decreased sig-
nificantly across the rules by 23.5% (z (N = 1080) = − 5.589/p < 0.001). Based on 
the human evaluation, the style quality (SQ) increased by 1.7% (z (N = 775) = − 
2.062/p = 0.039) and the content quality (CQ) improved even more by 2.9% (z 
(N = 775) = − 4.566/p < 0.001) after applying the rules.10 With regard to the auto-
matic evaluation, both AEMs scores rose slightly after the application of the rules. 
Furthermore, the Spearman correlation test showed a significant positive strong cor-
relation between the difference in the overall quality and the differences in the scores 
of TERbase (ρ (N = 775) = 0.520, p < 0.001) and hLEPOR (ρ (N = 775) = 0.519, 
p < 0.001), which indicates that an increase in the overall quality (i.e. mean of SQ 
and CQ) was accompanied by an improvement in the AEMs scores.
Accordingly, the general impact is consistent with the results of previous stud-
ies that found that CL application improves MT output (cf. Nyberg and Mitamura 
1996; Bernth 1999; Bernth and Gdaniec 2001, p 208; Drewer and Ziegler 2014, p 
9 The overall quality is the mean of the quality of style and quality of content, as analysing the correla-
tion here requires no distinction between the quality parameters.
10 As mentioned in Sect. 3.2, the results reported on the human evaluation are based on the total number 
of 1550 MT sentences, which is shown here as N = 775 referring to the comparison of 775 MT sentences 
of the “before CL scenario” with the 775 MT sentences of the “after CL scenario”.
1 3
An in‑depth analysis of the individual impact of controlled… 177
196). Nonetheless, the different changes in style and content quality after imple-
menting the rules pose the question: which cases specifically displayed a marked 
increase in content quality over style? This can be answered at rule level.
4.2  The impact of individual CL rules
The analysis of the annotation groups (FF, FR, RF, RR) revealed that, based on 
the existence and non-existence of MT errors, the CL impact cannot be effec-
tively considered positive. The only positive impact can be observed in the FR 
group (False before CL – Right after CL). This group ranges merely between 8% 
(rule “pas—Avoiding passives”) and 31% (rule “anz—Using straight quotes for 
interface texts”), Fig. 2.
In the RF annotation group (Right before CL – False after CL), the CL impact 
is clearly negative. The most dominant annotation groups in all rules were RR 
and FF. Since the translations were error-free (RR group) or faulty (FF group) 
both before and after the CL application, a positive impact of a certain rule can 
only be justified, if the quality values of these two groups increased after rule 
application. A quality increase in the RR group would mean that the quality of a 
correct MT after CL is higher (e.g. stylistically better) than that of a correct MT 
before CL. Similarly, a quality increase in the FF group would imply that compar-
ing two wrong translations before and after CL, the quality of the wrong MT after 
CL is higher (e.g. includes a less severe error type).
In order to explore quality changes in each annotation group, a triangulation of 
the results of the error annotation and human evaluation was performed. Table 6 
summarises how the style and content quality changed after the application of 
each rule at annotation group level.
Only two CL rules proved to have a positive impact on MT quality:
– The first one is “anz—Using straight quotes for interface texts”. This was 
the only rule in which the SQ and CQ of the FF and RR groups significantly 
increased after rule application. In addition, the highest percentage of the FR 
group (31%) and the lowest percentage of the RF group (2%) were represented 
in this CL rule. This shows a clear positive impact of using straight quotes for 
interface texts on MT quality.
– The second rule is “per—Avoiding the construction sein + zu + infinitive”. 
Avoiding sein + zu + infinitive (comparable to the structure to be + to + base 
infinitive) improved the SQ in the FF and RR group significantly, whereas 
the CQ did not show a significant change. Furthermore, both the SQ and CQ 
increased in the FR group significantly. Hence, using an imperative (after CL) 
instead of sein + zu + infinitive (before CL) had a positive impact on the MT 
output, particularly on the SQ.
Negative impacts of CL rules can be observed in the following three rules:
1 3
 178 S. Marzouk 
Example 1  Rule “anz—using straight quotes for interface texts”
Before CL Wählen Sie danach die Option Software automatisch installieren
Then select the option software automatically install
After CL Wählen Sie danach die Option "Software automatisch installieren"
Then select the option "Install software automatically"
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
– In the rule “pak—Avoiding participial constructions”, the RF group (22%) was 
more than twice as high as FR (9%). Further examination of the FF (42%) and 
RR (28%) groups shows that the SQ decreased significantly in FF and both the 
SQ and CQ decreased significantly in RR, i.e. when comparing two correct MT 
before and after applying the rule, the SQ and CQ were higher in cases where 
participial constructions were used (before CL). In the FR group, only the CQ 
increased significantly while the SQ increase was marginal. The results, accord-
ingly, indicate the difficulty of the MT of participial constructions (before CL) 
and show at the same time that substituting it with a subordinate clause (after 
CL) was linked to quality deterioration—particularly regarding style.
– Likewise, in the case of the rule “pas—Avoiding passives”, the representation of 
the RF group (13%) was larger than that of FR (8%). However, unlike the rule 
“pak”, the RR group (49%) was much higher than FF (29%), which shows that the 
systems were able to translate nearly half of the sentences in both scenarios (pas-
sive and active) correctly. In the FF group, using active voice (after CL) resulted 
in a significantly lower SQ. Even in the FR group, the increases in the SQ and CQ 
were not significant. In the RR group, the quality values did not change signifi-
cantly after using the active voice when compared to the passive voice.
– For the rule “wte—Avoiding omitting parts of the words”, in the FR group, only 
the CQ increased significantly. The application of this rule had a noticeable nega-
tive impact on the SQ, which was also detected in the FR, FF, and RR groups. 
This revealed that repetition associated with the usage of complete words instead 
of their reduced forms (see Example 9) was stylistically unacceptable.
The rest of the rules (“fvg—Avoiding light-verb construction (Funktions-
verbgefüge)”, “kos——Formulating conditions as ‘if’ sentences”, “nsp—Using 
unambiguous pronominal references”, and “prä—Avoiding superfluous prefixes”) 
did not show—at this analysis level—a significant impact on MT quality. All 
quality values of these four rules in the FF and RR groups were insignificant.
Thus, in spite of the general positive impact of the CL application shown in 
Sect. 4.1 at annotation group level, only two of the nine rules demonstrate having 
a positive impact on MT output. This result urged the researcher to further exam-
ine the data from different perspectives in order to find out whether further rules 
can be recommended for improving MT output.
1 3
An in‑depth analysis of the individual impact of controlled… 179
The analysis of the quality changes based on the human and automatic evalu-
ations confirms to a large extent the results obtained at annotation group level 
(Table 6):
The rules “anz—Using straight quotes for interface texts” and “per—Avoiding 
the construction sein + zu + infinitive” showed again a significant positive impact 
on MT quality. The scores of TERbase and hLEPOR improved; at the same time, 
based on the human evaluation, a significant positive impact on both SQ and CQ 
was detected at rule level.
– Regarding “anz”, the positive impact on the style was due to the clear ortho-
graphic presentation of the interface texts (see Example 1). In addition, using 
straight quotes enhanced the appropriateness of the translation to the intention of 
its content.
– Concerning “per”, the evaluators found that using the imperative instead of the 
construction sein + zu + infinitive was stylistically better, as it addressed the 
reader directly and incites him or her to act (For example, “Before the param-
eterization, configure the controller” instead of “Before the parameterization, the 
controller is to be configured”). Regarding the CQ, both the accuracy and clarity 
increased after rule application, while the effect on clarity was higher.
The effect of the rules “pak—Avoiding participial constructions”, “pas—Avoid-
ing passives”, and “wte—Avoiding omitting parts of the words” was negative:
– The rule “pak” was applied by generating a subordinate clause based on the par-
ticipial construction. The human evaluation revealed that the evaluators found 
the MT of the participial construction more idiomatic than that of the subordi-
nate clause (see Example 5b). Accordingly, the SQ decreased significantly while 
the CQ decrease was not significant. The automatic evaluation confirmed this 
result, showing a significant decrease in the quality scores of TERbase and hLE-
POR.
– In the case of “wte”, because of noun repetition (instead of using the reduced 
form, see Example 9), the evaluators judged the MT as unnatural. Thus, the SQ 
dropped significantly, while the CQ decreased but not significantly. In addition, 
the scores of both AEMs declined significantly.
– In “pas”, all quality parameters (SQ, CQ, and both AEMs scores) decreased sig-
nificantly after rule application. Avoiding the use of passive voice is a widely 
recommended CL rule. Several studies argue that avoiding the passive voice 
improves machine translatability, as it enables circumventing grammatical pars-
ing issues (Bernth and Gdaniec 2001; Reuther 2003; Fiederer and O’Brien 
2009). According to the human evaluation, stylistically, the evaluators consid-
ered that the active voice (after CL) is not ideal for the intention of the sentence. 
Concerning the content quality, the accuracy was judged to be lower after rule 
application (see Example 6).
Parallel to the triangulated results of the error annotation and human evaluation 
(Table  6), the AEMs scores (both TERbase and hLEPOR) as well as the human 
1 3
 180 S. Marzouk 
Example 2  Rule “fvg—avoiding light-verb construction”
Before CL Die Reinigung der Küchenmöbel sollten Sie mit einem leicht feuchten Tuch vornehmen
The cleaning of the kitchen furniture you should start with a slightly damp cloth
After CL Sie sollten die Küchenmöbel mit einem leicht feuchten Tuch reinigen
You should clean the kitchen furniture with a slightly damp cloth
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
Example 3  Rule “kos—formulating conditions as ‘if’ sentences”
Before CL Steht ein normierter Faktor zur Verfügung, kann dieser Faktor direkt in der Eingabemaske 
eingegeben werden
XXX Is a standardized factor available, this factor can be entered directly in the input mask
After CL Wenn ein normierter Faktor zur Verfügung steht, kann dieser Faktor direkt in der Eingabe-
maske eingegeben werden
If a standardized factor is available, this factor can be entered directly in the input mask
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts; XXX refers to an omission
scores (of SQ and CQ) did not reflect a significant impact of the rules “nsp—Using 
unambiguous pronominal references” and “prä—Avoiding superfluous prefixes” on 
MT quality:
– Regarding “nsp”, the decision of using a pronoun or substituting it with its refer-
ence (i.e. applying or rejecting the rule) is usually made on a case-by-case basis 
depending on the sentence and the formulation of the sentences that precede and 
follow it (Bernth and Gdaniec 2001). That may be the reason why no significant 
impact could be detected. When identifying the reference was difficult, the usage 
of the pronominal reference was advantageous, which resulted in an increase in 
MT clarity. However, in some cases, the repetition of the pronominal reference 
was criticised.
– Regarding the rule “prä”, the RR annotation group was very dominant (67.5%). 
If the MT systems were able to translate a verb with and without a superflu-
ous prefix (e.g. anbieten and bieten) correctly, the translations were identical in 
both cases (correct translation: offer). Having a large number of correct identical 
translations led to achieving comparable quality before and after rule application.
For the last two rules “fvg—Avoiding light-verb construction (Funktionsverb-
gefüge)” and “kos—Formulating conditions as ‘if’ sentences”, while the analysis of 
the annotation groups did not reflect a substantial quality increase (only the quality 
in FR increased significantly, see Table 6), in the human and automatic evaluations, 
an improvement of MT quality was observed after rule application.
– With regard to the rule “fvg—Avoiding light-verb construction (Funktionsverb-
gefüge)”, a significant increase in MT quality (SQ, CQ, and both AEMs scores) 
1 3
An in‑depth analysis of the individual impact of controlled… 181
was detected after rule application. Using the meaning-bearing verb instead of 
the light-verb construction enhanced the MT semantically and lexically, as not all 
German light verbs have a counterpart in English (see Example 2). This in turn 
made the translation more appropriate to its intention and easier to understand.
– Regarding “kos—Formulating conditions as ‘if’ sentences”, the automatic evalu-
ation (both TERbase and hLEPOR) showed only a slight quality increase after 
the rule application. However, based on the human evaluation, a significant 
improvement in the SQ and CQ was found. Leaving out the conditional conjunc-
tion ‘wenn’ (if), which is grammatically possible in German but not in English, 
caused grammatical parsing issues (see Example 3). The rule application enabled 
better parsing, therefore the quality scores increased with regard to accuracy, 
clarity, and idiomaticity.
As for the correlation between the difference in the overall quality and the differ-
ences in the AEMs scores, the Spearman correlation test showed a significant strong 
positive correlation for the rules “fvg”, “kos” and “pas” (ρ > 0.5) and a significant 
moderate positive correlation in the remaining rules (ρ > 0.3). Accordingly, the qual-
ity changes detected in both analyses (human and automatic evaluation) were in line 
with each other.
4.3 T he impact of individual CL rules at MT system level
So far, the results at rule level showed that four rules had a positive impact on the 
MT quality (“anz—Using straight quotes for interface texts”, “per—Avoiding the 
construction sein + zu + infinitive”, “fvg—Avoiding light-verb construction”, and 
“kos—Formulating conditions as ‘if’ sentences”) and three rules tended to have a 
negative impact on MT quality (“pak—Avoiding participial constructions”, “pas—
Avoiding passives”, and “wte—Avoiding omitting parts of the words”)—particu-
larly regarding style quality. For these seven rules, the following analysis at MT sys-
tem level explored which systems displayed the identified effect.
The impact of the two remaining rules, “nsp—Using unambiguous pronomi-
nal references” and “prä—Avoiding superfluous prefixes”, was not conclusive. For 
these two rules, the following analysis at MT system level closely examined whether 
a significant impact was traceable within a certain MT system.
Rule 1 “anz—Using straight quotes for interface texts” was the only rule associ-
ated with a reduction in the number of errors (Fig. 3) as well as an improvement in 
SQ and CQ (Fig. 4) in all MT systems. The decrease in the number of errors after 
rule application was not significant in the case of the NMT system (Google Trans-
late) and one hybrid system (Systran) for different reasons; in Google Translate, the 
number of errors was very small (5 errors before CL, 1 error after CL). The per-
centage of sentences translated correctly both before and after the usage of straight 
quotes was 83% in Google Translate, followed by only 17% in SDL. Accordingly, 
the SQ and CQ were the highest in Google Translate in both scenarios. In Systran, 
the number of errors was very high and barely changed (a decrease from 42 to 41 
errors).
1 3
 182 S. Marzouk 
As Example 1 shows, separating the interface text Software automatisch instal-
lieren using quotes enabled the systems to identify the text as a caption. This in 
turn improved the parsing of the source sentence. Accordingly, two error types 
(see install) were corrected after the rule application: capitalising the interface 
text (OR.02) and correcting the word order (GR.10). In Lucy, the correction 
of these two error types after rule application was strongly correlated with the 
quality increase. In Bing, only the correction of the word order error proved to 
strongly correlate with the quality improvement. In the other systems, no correla-
tions between any of the error types and the quality could be detected.
For the second rule “fvg—Avoiding light-verb construction (Funktionsverb-
gefüge)”, the general positive impact on MT output at system level was as follows 
(Figs. 5 and 6): For the RBMT system (Lucy) and one hybrid system (Systran), 
this rule was very advantageous in reducing the number of errors and increasing 
SQ significantly. In the SMT system (SDL), the number of errors decreased sig-
nificantly, but the increase in SQ and CQ was not significant. In the second hybrid 
system (Bing), the number of errors decreased and the SQ and CQ increased; 
however, these changes were not significant. The human evaluators found that 
using the meaning-bearing verb (after CL) instead of the light verb construc-
tion (before CL) makes the translation easier to understand and stylistically more 
attention-grabbing. Analysing the NMT system (Google Translate) showed dis-
tinct results: the number of errors was minimal (3 errors before CL, 1 error after 
CL). It was able to translate 88% of the sentences both before and after the rule 
application correctly, followed by 46% in Bing. This displayed the highest SQ 
and CQ among all systems both before and after rule implementation.
“Avoiding light-verb construction” is primarily related to sentence seman-
tics. Since not all German light verbs have a counterpart in English, using the 
meaning-bearing verb (reinigen in Example 2, after CL) instead of the light-verb 
construction (Reinigung vornehmen, before CL) was associated with a correc-
tion of a number of semantic errors, particularly collocation (SM.13) and lexical 
errors. Lexical errors occurred when the systems translated the light verb literally 
(e.g. translating zur Verfügung stellen as represent available instead of provide). 
In Lucy, the correction of the semantic errors correlated with an increase in SQ 
and CQ. In Bing and SDL, a correlation was observed between the lexical errors 
(LX.03 and LX.04) and the quality. No further correlations were detectable in the 
other systems.
The application of Rule 3 “kos—Formulating conditions as ‘if’ sentences” was 
associated with a reduction in the number of errors in all systems except for the 
NMT system (Google Translate: 1 error before CL, 2 errors after CL), Fig. 7. The 
reduction in MT errors was only significant in one hybrid system (Bing) and the 
SMT system (SDL). Consequently, it was only in these two systems that a significant 
improvement in quality was achieved—in Bing both for the SQ and CQ, and in SDL 
only for the CQ, Fig. 8. This positive effect on the quality correlated strongly with 
the reduction in the error types LX.03 “Omission” and GR.10 “Wrong word order” 
(as noted in Example 3). In contrast, both SQ and CQ decreased in the RBMT sys-
tem (Lucy) and the other hybrid system (Systran) after rule application. In Google 
Translate, the percentage of correct MT before and after the rule application (Group 
1 3
An in‑depth analysis of the individual impact of controlled… 183
Example 4  Rule “nsp—using unambiguous pronominal references”
Before CL Fettreste müssen vollständig abgewaschen werden, da sich diese ansonsten in der Pfanne 
einbrennen können
Grease residue must be completely washed off as it can otherwise burn in the pan
After CL Fettreste müssen vollständig abgewaschen werden, da sich diese Reste ansonsten in der 
Pfanne einbrennen können
Grease residue must be completely washed off as this remains can otherwise burn in the pan
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
RR) was 92%, followed by 71% in Lucy. This once again showed the highest SQ and 
CQ in both scenarios.
“Formulating conditions as ‘if’ sentences” mainly showed a positive lexical effect 
on the MT. Considering Example 3, the German verb can be used as a conditional 
word without the need for the conditional conjunction wenn (if). This is not the case 
in English, which was why omitting the conditional conjunction (before CL) was 
associated with two error types: the conditional conjunction if was missing (LX.03), 
and the verb used to formulate the conditional clause was placed incorrectly 
(GR.10), see is in Example 3. Therefore, a reduction in the number of errors in both 
error types was observed after the rule application. In Bing and SDL, the correction 
of these errors strongly correlated with the increase in quality. In the other systems, 
no correlations with specific error type were observed. After rule application, the 
evaluators found the MT more accurate, understandable, and idiomatic.
Rule 4 “nsp—Using unambiguous pronominal references” had a different impact 
on MT quality from one system to the other (Fig. 9). Only the RBMT system (Lucy) 
and the NMT system (Google Translate) exhibited significant quality changes: 
Lucy showed a slight increase in the SQ and a significant increase in the CQ, while 
Google Translate showed exactly the opposite—namely a significant decrease in the 
SQ and a slight decrease in the CQ. This could be explained by the various changes 
in the number of errors (Fig. 10). Google Translate, as opposed to the other systems, 
was mostly able to translate the pronouns correctly (before CL). However, using a 
pronominal reference (after CL) was stylistically criticised in some cases. Neverthe-
less, 83% of the translations by Google Translate were error-free before and after the 
rule application, followed by 67% in Lucy. In addition, Google Translate achieved 
the highest quality scores in both scenarios.
“Using unambiguous pronominal references” had two different effects on the MT: 
in Lucy, SDL, and Systran, the rule application was associated with a reduction in 
the “Confusion of sense” semantic error (SM.11). This error was especially appar-
ent in the translation of demonstrative pronouns (diese and dies), as the MT systems 
found difficulties in identifying the reference and translating it correctly. Accord-
ingly, after rule application, the human evaluators found the translation clearer.
However, the application of this rule was also associated with an increase in 
the lexical “Consistency” error (LX.06) in Bing and SDL, see Example 4. In 
order to implement this rule, a noun in the main clause should not be substituted 
1 3
 184 S. Marzouk 
Example 5a  Rule “pak—avoiding participial constructions”
Before CL Durch Eingabe der mit einem roten Sternchen gekennzeichneten Parameter erfolgt die 
minimale Konfigurierung
By entering the marked with a red asterisk parameter, the minimum configuration is 
performed
After CL Durch Eingabe der Parameter, die mit einem roten Sternchen gekennzeichnet sind, 
erfolgt die minimale Konfigurierung
By entering the parameters that are marked with a red asterisk, the minimum configura-
tion is performed
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
Example 5b  Rule “pak—avoiding participial constructions”
Before CL Speziell auf diese Lautsprecher abgestimmtes Zubehör erhalten Sie in unserem Webshop
Special accessories for these speakers are available in our webshop
After CL Zubehör, das speziell auf diese Lautsprecher abgestimmt ist, erhalten Sie in unserem 
Webshop
Accessories, specially designed for these loudspeakers, are available in our webshop
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
by a pronoun in the subordinate clause; instead, a pronominal reference should 
be used (see Reste in Example 4). In some cases, the MT systems translated the 
second instance of the noun differently (residue in the main clause and remains in 
the subordinate clause), which resulted in a consistency error and hence reduced 
accuracy. However, a consistency error could be avoided if the terms used are 
maintained and managed in the system.
Rule 5 “pak—Avoiding participial constructions” had—in general—a negative 
effect on the SQ in all MT systems. Regarding the CQ, it only increased in two 
MT systems (Fig. 11): marginally in the SMT system (SDL) and significantly in 
one hybrid system (Systran). The number of errors increased in all systems except 
in SDL, where a slight decrease was found (Fig. 12). The NMT system (Google 
Translate) had no difficulty in translating participial constructions (only 4 errors 
before CL as opposed to 8 errors after CL). Furthermore, 71% of the translations 
by Google Translate were error-free before and after rule application, followed by 
only 29% in Bing. This demonstrated the highest quality rates in both scenarios. 
In all other systems, the number of errors was much higher both before and after 
rule application. Systran showed a significant increase in the number of errors 
after rule application.
Two different MT error types were associated with this rule: German participial 
constructions, especially lengthy ones, usually complicate sentence structure and 
consequently parsing, which results in word order errors (GR.10), see the participial 
construction der mit einem roten Sternchen gekennzeichneten Parameter in Example 
1 3
An in‑depth analysis of the individual impact of controlled… 185
Example 6  Rule “pas—avoiding passives”
Before CL Durch diese Öffnung kann der Stecker mit dem Regler verbunden werden
Through this opening, the plug can be connected to the controller
After CL Durch diese Öffnung können Sie den Stecker mit dem Regler verbinden
Through this opening, the plug you can connect to the controller
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
Example 7  Rule “per—avoiding the construction sein + zu + infinitive”
Before CL Wenn ein mehrstufiges Modul parametriert ist, so sind die externen Kontakte zu verriegeln
If a multi-stage module is parameterized, the external contacts are to lock
After CL Wenn ein mehrstufiges Modul parametriert ist, verriegeln Sie die externen Kontakte
If a multi-stage module is parameterized, you lock the external contacts
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
5a (before CL). This error type especially occurred in SDL and Bing in the transla-
tion of participial constructions and decreased after rule implementation.
However, after implementing the rule, all systems had difficulty with the comma 
placement in subordinate clauses, specifically in cases in which a distinction 
between the conjunction which and that needed to be made. Therefore, the number 
of punctuation errors (OR.01) increased. Consider Example 5b: unlike German, no 
commas are needed in English after the rule application. Nevertheless, it is impor-
tant to note that the use of which vs. that is generally controversial and contextual 
information is usually required to decide on correct usage.
Regarding Rule 6 “pas—Avoiding passives”, all MT systems—except for 
Bing—demonstrated an increase in the total number of errors after rule applica-
tion (Fig. 13). The two hybrid MT systems delivered contradictory results: in Bing, 
the total number of errors decreased significantly, while it increased significantly 
in Systran. In the other three systems, the increase in the number of errors was not 
significant. Bing and Google Translate were able to translate 71% of the sentences 
correctly in both passive and active voice, followed by 58% in Lucy. Both SQ and 
CQ decreased in all MT systems, except in Bing, where the CQ increased slightly 
(Fig. 14). In Systran, the SQ and CQ decrease was significant. In Lucy, the decrease 
in SQ was significant. Both before and after rule application, the highest SQ and CQ 
were seen in Google Translate. The rule application was associated with an increase 
in various error types across all systems; at the same time, none of the error types 
exhibited a significant increase after the rule application.
Example 6 shows how the use of active voice (after CL) was in some cases asso-
ciated with a word order error (GR.10) (in you can connect), while the passive voice 
(before CL) was correctly translated.
Rule 7 “per—Avoiding the construction sein + zu + infinitive “ demonstrated a 
general positive impact on the MT output (Figs. 15 and 16): in one hybrid system 
1 3
 186 S. Marzouk 
Example 8  Rule “prä—avoiding superfluous prefixes”
Before CL Schicken Sie das Gerät originalverpackt an unsere Serviceadresse ein
Please send the appliance in its original packaging to our service address one
After CL Schicken Sie das Gerät originalverpackt an unsere Serviceadresse
Please send the appliance in its original packaging to our service address
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
Example 9  Rule “wte—avoiding omitting parts of the words”
Before CL Sogar Soja- und laktosefreie Milch lassen sich mit dieser Maschine perfekt aufschäumen
Even Soya- and lactose-free milk can be perfectly frothed with this machine
After CL Sogar Sojamilch und laktosefreie Milch lassen sich mit dieser Maschine perfekt auf-
schäumen
Even soya milk and lactose-free milk can be perfectly frothed with this machine
The CL position is presented in bold black. Italic is used for correct parts of the translation; underlining 
for the wrong parts
Table 2  Error classification applied in the annotation
Error category Error no. Error type
Orthography OR.01 Punctuation error
OR.02 Capitalisation error
Lexis LX.03 Omission
LX.04 Addition
LX.05 Untranslated
LX.06 Consistency error (a word is repeated in the sentence and 
translated differently each time)
Grammar GR.07 Wrong word class
GR.08 Wrong verb tense/composition/person
GR.09 Wrong agreement gender/number/person
GR.10 Wrong word order
Semantics SM.11 Confusion of sense (the output translation is possible, but 
not in the given context)
SM.12 Wrong choice (the output translation is apparently wrong)
SM.13 Collocation error
Table 3  Overview of the dataset
Number of annotated MT sentences 2160 (100%)
Number of human-evaluated MT sentences 1550 (72%)
Number of excluded MT sentences 610 (28%)
1 3
An in‑depth analysis of the individual impact of controlled… 187
Table 4  Representation of FR FF RF RR
the annotation groups in the 
error annotation and human Error annotation 19.8% 25.8% 9.8% 44.5% 100.0%
evaluation
Human Evaluation 20.4% 24.7% 10.7% 44.2% 100.0%
Error annotation N = 2160 MT sentences; Human evaluation 
N = 1550 MT sentences
Table 5  Percentages of MT Bing Google Lucy SDL Systran
sentences of each system in the 
human evaluation 18% 25% 20% 20% 17% 100.0%
Human evaluation N = 1550 MT sentences
Fig. 1  Interface of the human evaluation
(Bing) and the SMT system (SDL), grammatical difficulties were observed when 
the construction sein + zu + infinitive (before CL) was used, although an equiva-
lent construction exists in English (verb to be + to + infinitive). Accordingly, the 
rule application was associated with a significant reduction in two grammatical 
errors: incorrect verb tense (GR.08) and incorrect word order (GR.10) (see verb 
composition error in are to lock in Example 7, before CL). In Bing and SDL, cor-
recting these error types correlated strongly with an increase in quality values.
The NMT system (Google Translate) was able to correctly translate 96% of 
the sentences both before and after rule application, followed by 42% in Systran. 
Therefore, the quality of MT output in both scenarios was the highest with a min-
imal increase in the SQ and a minimal decrease in the CQ.
However, applying the CL rule using the imperative (instead of the con-
struction sein + zu + infinitive) was associated with the “Addition” lexical error 
(LX.04) (after CL) in the RBMT system (Lucy) and the other hybrid system 
1 3
 188 S. Marzouk 
Fig. 2  Comparison of the annotation groups at CL rule level. The percentages displayed on the top of the 
bar are calculated based on the entire dataset of all rules (N = 1080). The percentages on the bottom are 
calculated at rule level (N = 120)
(Systran), as in some cases, the MT systems wrongly added the subject you (see 
Example 7, after CL). In Systran, the correction of this lexical error correlated 
strongly with the increase in the quality values.
Rule 8 “prä—Avoiding superfluous prefixes” was—in general—associated with 
a small number of errors both before (ranging between 4 errors in Google Translate 
and 12 errors in SDL) and after rule application (ranging between 3 errors in Bing 
and Google and 11 errors in SDL), Fig. 17. As these ranges show, the number of 
errors decreased after rule application. This reduction was only significant in one 
hybrid system (Bing). The SQ slightly improved in all systems except for Google 
Translate (Fig. 18). Also, the CQ increased minimally in Bing, Lucy, and Systran. 
In Google Translate, 88% of the sentences were correctly translated both before and 
after rule application, followed by 71% in Bing. The quality values in Google Trans-
late showed a minimal decrease after rule application (SQ − 0.06, CQ − 0.03); at 
the same time, they were the highest among all MT systems both before and after 
the rule implementation.
In all systems, except for Google Translate, avoiding superfluous prefixes sup-
ported correct parsing of the verb. In particular, German “separable verbs” (similar 
to phrasal verbs in English) were often difficult to parse; depending on the sentence 
structure, prefixes should sometimes be placed at the end of the sentence—far from 
the rest of the verb. In such cases, the systems translated the prefix additionally, 
independently of the verb. Thus, avoiding superfluous prefixes resulted in correcting 
the “Addition” lexical error (LX.04) (see one in Example 8, before CL), which in 
turn improved the accuracy of the translation.
Finally, Rule 9 “wte—Avoiding omitting parts of the words” was associated with 
a marginal decrease in the number of errors in Google Translate, Lucy, and Systran 
(Fig. 19). Due to the differences in the orthographic rules in German and English 
1 3
An in‑depth analysis of the individual impact of controlled… 189
1 3
Table 6  Differences in style and content quality after the application of each rule at annotation group level
SQ style quality, CQ content quality, Q overall quality (mean of SQ and CQ); Diff. SQ = SQ before – SQ after CL application, similarly Diff. CQ and Diff. Q. White cells: 
 190 S. Marzouk 
1 3
Table 6  (continued)
insignificant values p ≥ 0.05; shaded cells: significant values p < 0.05. M mean
Bold shows rules that have a significant positive impact; Italics for a significant negative impact; For the rest of the rules, the results were insignificant
An in‑depth analysis of the individual impact of controlled… 191
45 42 41
40
35 33 30
30 27
25
20 16
15 10
10 6 5
5 1
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 3  Rule “anz—using straight quotes for interface texts”—number of MT errors before and after rule 
application
Fig. 4  Rule “anz—using straight quotes for interface texts”—style and content quality before and after 
rule application
with regard to hyphen usage, applying this rule was associated with a reduction in 
orthographic errors. Example 9 shows how the rule implementation was associated 
with the correction of punctuation and capitalisation errors (in Soya-).
However, the number of errors marginally increased in Bing and remained 
unchanged in SDL. Despite the different impacts on the number of errors, the 
SQ and CQ decreased in all systems except Systran, in which the CQ slightly 
increased (Fig.  20). The SQ decrease was significant in three systems (Google 
1 3
 192 S. Marzouk 
30
25
25 22 23
20 17
15
10 11
10 8 9
5 3
1
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 5  Rule “fvg—avoiding light-verb construction”—number of MT errors before and after rule appli-
cation
Fig. 6  Rule “fvg—avoiding light-verb construction”—style and content quality before and after rule 
application
Translate, Lucy, and SDL), as the human evaluators found the noun repetition 
unnatural (see milk in soya milk and lactose-free milk in Example 9, after CL). 
Again, Google Translate showed the lowest number of errors (75% of the transla-
tions were error-free both before and after CL, followed by 46% in SDL) and the 
highest quality values in both scenarios.
1 3
An in‑depth analysis of the individual impact of controlled… 193
45 42
40
35 30
30
25
20
15 13 10 11 10
10 8 5
5 1 2
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 7  Rule “kos—formulating conditions as ‘if’ sentences”—number of MT errors before and after rule 
application
Fig. 8  Rule “kos—formulating conditions as ‘if’ sentences”—style and content quality before and after 
rule application
5 C onclusion
The study aims to analyse and contrast the impact of individual CL rules on MT out-
put at different levels. In accordance with many previous studies, the results showed 
that CL application had in general, at a rule- and system-independent level, a signifi-
cant positive impact on the MT output in terms of reducing the number of errors and 
1 3
 194 S. Marzouk 
Fig. 9  Rule “nsp—using unambiguous pronominal references”—style and content quality before and 
after rule application
16 15
14 13
12
12
10 9
8
8
6 6 6
6 5
4
2
2
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 10  Rule “nsp—using unambiguous pronominal references”—number of MT errors before and after 
rule application
increasing the style and content quality as well as the scores of two AEMs (TERbase 
and hLEPOR).
A closer analysis of the individual impact of the rules, at a system-independent 
level, revealed that only the rules “anz—Using straight quotes for interface texts”, 
“per—Avoiding the construction sein + zu + infinitive”, “kos—Formulating con-
ditions as ‘if’ sentences”, and “fvg—Avoiding light-verb construction” positively 
1 3
An in‑depth analysis of the individual impact of controlled… 195
Fig. 11  Rule “pak—avoiding participial constructions “—style and content quality before and after rule 
application
50 47
40 36
31 33 32
30 26
20
20
12
10 84
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 12  Rule “pak—avoiding participial constructions”—number of MT errors before and after rule 
application
affected the MT output (fewer errors and better SQ, CQ, and AEMs scores). These 
rules enabled better parsing, which in turn contributed to getting a more accurate, 
comprehensible, stylistic, and attention-grabbing translation. On the contrary, the 
rule “pas—Avoiding passives” showed a significant negative impact on the SQ, 
CQ, and the AEMs scores. The human evaluators assessed the MT of the active 
voice to be less accurate and stylistically less adequate. In the rule “pak—Avoiding 
participial constructions”, the AEMs scores and the SQ deteriorated significantly, 
as the MT of participial constructions was evaluated as more idiomatic. In the rule 
1 3
 196 S. Marzouk 
35
30
30
26
25 23
20
15 13
10 10
10 8 8
5
5 3
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 13  Rule “pas—avoiding passives”—number of MT errors before and after rule application
Fig. 14  Rule “pas—avoiding passives”—style and content quality before and after rule application
“wte—Avoiding omitting parts of the words”, only the SQ and both AEMs scores 
decreased, which showed that the MT sounded unnatural after rule application. For 
the rules “nsp—Using unambiguous pronominal references” and “prä—Avoiding 
superfluous prefixes”, no significant impact was found.
A more detailed examination of the impact of each rule at MT system level 
showed that when earlier MT approaches (RBMT, SMT, and hybrid systems) 
were applied, the impact of the individual rules varied to a large extent from 
one approach to the other. Since not all CL rules have a definite positive impact, 
1 3
An in‑depth analysis of the individual impact of controlled… 197
40 37
35
30 27
25
20 21
20
15
11
10 8 9
5 4
1 1
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 15  Rule “per—avoiding the construction sein + zu + infinitive “—number of MT errors before and 
after rule application
Fig. 16  Rule “per—avoiding the construction sein + zu + infinitive”—style and content quality before and 
after rule application
identifying effective rules in each implementation context (language pair, trans-
lation direction, domain, and MT approach) is necessary. Limiting the number 
of rules applied to those that are effective can be beneficial in avoiding draw-
backs commonly associated with CL application (e.g. slowing down the author-
ing process and impacting it excessively). Comparing the earlier MT approaches 
to the recent NMT approach, the results revealed that while earlier MT systems 
1 3
 198 S. Marzouk 
14
12
12 11 11
10
10
8
8 7
6 5
4
4 3 3
2
0
Bing Google Lucy SDL Systran
Sum of errors before CL
significant values Sum of errors aer CL
Fig. 17  Rule “prä—avoiding superfluous prefixes”—number of MT errors before and after rule applica-
tion
Fig. 18  Rule “prä—avoiding superfluous prefixes”—style and content quality before and after rule appli-
cation
benefited in many cases from the CL rules in avoiding different MT errors and 
improving their output quality, the NMT system was able to translate most of the 
sentences before and after the application of all rules error-free (between 71% in 
the rules “pas” and “pak” and 96% in the rule “per”). Moreover, the NMT system 
recorded the highest style and content quality in both scenarios under all rules.
1 3
An in‑depth analysis of the individual impact of controlled… 199
25 23 22
20
20 19
15 14 13 13
11
10 7
5
5
0
Bing Google Lucy SDL Systran
Sum of errors before CL
Sum of errors aer CL
Fig. 19  Rule “wte—avoiding omitting parts of the words”—number of MT errors before and after rule 
application
Fig. 20  Rule “wte—avoiding omitting parts of the words”—style and content quality before and after 
rule application
6 L imitations and future work
This study has explored the impact of a limited number of CL rules (nine rules) 
on the machine translatability of five different MT architectures, including NMT, 
which—to the best of my knowledge—has not yet been examined. For the analysed 
rules, the CL application with the aim to enhance the MT output is no longer neces-
sary when neural MT technology is used. However, the analysis within the scope of 
1 3
 200 S. Marzouk 
this study was carried out at sentence level, so it examined only CL rules applied 
within the sentence. An analysis at an extended level (e.g. cross-sentential or docu-
ment level) is of great interest in order to capture the impact of context-relevant CL 
rules (i.e. rules that affect several sentences) on the MT output. Contextual MT or 
MT at document level is one of the known challenging goals of MT (Zhang and 
Zong 2020). Recent NMT studies show that advances have already been made in 
the field of context-aware MT and MT at document level, even in the translation of 
literature, which is considered to be a particularly challenging domain for MT (cf. 
Toral and Way 2018; Matusov 2019). Furthermore, several studies have developed 
context-sensitive NMT models as well as strategies for NMT at document level, 
with which classic MT difficulties such as deixis, ellipses, co-reference resolution, 
coherence and lexical cohesion could be overcome (Müller et al. 2018, Stojanovski 
and Farser 2018, Voita et al. 2018, Stojanovski and Farser 2019, Voita et al. 2019). 
Based on the results of the present study as well as the rapid development progress 
of NMT, it can be expected that an application of CL for the purpose of machine 
translatability will be pushed into the background in the near future. To what extent 
CL can in the meantime support contextual MT or help to overcome other current 
NMT weaknesses is a question that needs to be answered empirically through the 
investigation of further CL rules across various NMT systems.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, 
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as 
you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article 
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the 
material. If material is not included in the article’s Creative Commons licence and your intended use is 
not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission 
directly from the copyright holder. To view a copy of this licence, visit http:// creati veco mmons.o rg/ licen 
ses/b y/4. 0/.
References
Aikawa T, Schwartz L, King R, Corston-Oliver M, Lozano M (2007) Impact of controlled language on 
translation quality and post-editing in a statistical machine translation environment. In: Proceedings 
of the eleventh machine translation Summit 10–14 September, Copenhagen, Denmark, pp 1–7
Alonso Martin JA, Serra AC (2014) Integration of a machine translation system into the editorial process 
flow of a daily newspaper. Procesamiento Del Lenguaje Natural 53:193–196
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correla-
tion with human judgments. In: ACL 2005, Proceedings of the workshop on intrinsic and extrinsic 
evaluation measures for MT and/or summarization at the 43rd Annual meeting of the association for 
computational linguistics, Ann Arbor, Michigan, pp 65–72
Bernth A (1999) Controlling input and output of MT for greater user acceptance. In: Proceedings of 
the 21st conference of translating and the computer sponsored by ASLIB, 10–11 November 1999, 
London
Bernth A, Gdaniec C (2001) MTranslatability. In: Machine translation, December 2001, vol. 16, no. 3. 
Kluwer Academic Publishers, Dordrecht, pp 175–218
1 3
An in‑depth analysis of the individual impact of controlled… 201
Caterpillar Corporation (1974) Dictionary for Caterpillar Fundamental English. Caterpillar Corporation, 
Peoria
Drewer P, Ziegler W (2014) Technische Dokumentation. Eine Einführung in die übersetzungsgerechte 
Texterstellung und in das Content Management, 2nd edn. Vogel, Würzburg
Fiederer R, O’Brien S (2009) Quality and machine translation: a realistic objective? J Spec Transl 11:52–74
Gesellschaft für Technische Kommunikation – tekom e. V. (2013) Leitlinie “Regelbasiertes Schreiben, 
Deutsch für die Technische Kommunikation”. 2. Erweiterte Auflage. Stuttgart.
Gonzàlez M, Giménez J (2014) An open toolkit for automatic machine translation (meta-) evaluation. 
Technical manual v3.0. February 2014. Technical Report LSI-14-2-T. Departamento de Lenguajes y 
Sistemas Informáticos, Universitat Politècnica de Catalunya
Han ALF, Wong DF, Chao LS, He L, Lu Y, Xing J, Zeng X (2013) Language-independent model for 
machine translation evaluation with reinforced factors. In: Proceedings of the machine transla-
tion summit XIV (MT SUMMIT 2013), International Association for Machine Translation, Nice, 
France, pp 215–222
Holmback H, Shubert S, Spyridakis JH (1996) Issues in conducting empirical evaluations of con-
trolled languages. In: Adriaens G, Havenith R (eds) Proceedings of the 1st international work-
shop on controlled language applications, (CLAW 1996), Leuven, Belgium, pp 166–177
Huijsen WO (1998) Controlled language: an introduction. In: Proceedings of the second controlled 
language application workshop (CLAW 1998), Pittsburgh, Pennsylvania, pp 1–15
Hutchins J, Somers HL (1992) An introduction to machine translation. Academic Press Limited, 
Cambridge
Kamprath C, Adolphson E, Mitamura T, Nyberg E (1998) Controlled language for multilingual docu-
ment production: experience with caterpillar technical English. In: Mitamura et  al (eds) Pro-
ceedings of the second international workshop on controlled language applications—CLAW ‘98, 
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, pp 51–61
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan C, Shen W, Moran C, 
Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statisti-
cal machine translation. In: Proceedings of the 45th annual meeting of the association for com-
putational linguistics companion volume proceedings of the demo and poster sessions, Prague, 
Czech Republic, pp 177–180
Lehrndorfer A, Reuther U (2008) Kontrollierte Sprache—standardisierte Sprache? In: Muthig, Jür-
gen (ed) Standardisierungsmethoden für die Technischen Dokumentation. Schmidt-Römhild, 
Lübeck. (=tekom Hochschulschriften Nr.16), pp 97–121
Lommel A (2018) Metrics for Translation Quality Assessment: a case for standardising error typolo-
gies. In: Doherty S, Castilho S, Moorkens J, Gaspari F (eds) Human and machine translation 
quality and evaluation—from principles to practice. Springer, Berlin, pp 109–127
Marzouk S, Hansen-Schirra S (2019) Evaluation of the impact of controlled language on neural 
machine translation compared to other MT architectures. Mach Transl 33(1–2):179–203
Marzouk (in press) Sprachkontrolle im Spiegel der Maschinellen Übersetzung—Untersuchung zur 
Wechselwirkung ausgewählter Regeln der Kontrollierten Sprache mit verschiedenen Ansätzen 
der Maschinellen Übersetzung. Doctoral dissertation, Johannes Gutenberg University, 
Germersheim
Matusov E (2019) The challenges of using neural machine translation for literature. In: Proceedings 
of the workshop on qualities of literary machine translation, 17th MT Summit, Dublin, Ireland, 
pp 10–19
Mehta S, Azarnoush B, Chen B, Saluja A, Misra V, Bihani B, Kumar R (2020) Simplify-then-translate: 
automatic preprocessing for black-box translation. In: Proceedings of the 34th AAAI conference on 
artificial intelligence, New York, 8pp
Müller M, Rios A, Voita E, Sennrich R (2018) A large-scale test set for the evaluation of context-aware 
pronoun translation in neural machine translation. In: Proceedings of the third conference on 
machine translation: research papers", Brussels, Belgium, pp 61–72
Nyberg E, Mitamura T (1996) Controlled language and knowledge-based machine translation: principles 
and practice. In: Proceedings of the first controlled language application workshop (CLAW 1996), 
Leuven, Belgium, pp 74–83
Nyberg E, Mitamura T, Hujisen WO (2003) Controlled langauge for authoring and translation. In: Som-
ers H (ed) Computers and translation: a handbook, benjamins translation library, vol 35. John Ben-
jamins Publishing Company, Amsterdam, Philadelphia, pp 71–110
1 3
 202 S. Marzouk 
O’Brien S (2006) Machine translatability and post-editing effort: an empirical study using translog and 
choice network analysis. PhD dissertation. Dublin City University, Ireland
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine 
translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics, 
Philadelphia, PA, pp 311–318
Reuther U (2003) Two in one—can it work? Readability and translatability by means of Controlled Lan-
guage. In: Proceedings of the joint conference combining the 8th international workshop of the 
European Association for Machine Translation and the 4th controlled language applications work-
shop (CLAW 2003), Dublin, Ireland, pp 124–132
Rösener C (2010) Computational linguistics in the translator’s workflow—combining authoring tools and 
translation memory systems. In: Proceedings of the NAACL HLT 2010 workshop on computational 
linguistics and writing: writing processes and authoring aids, Los Angeles, California, pp 1–6
Roturier J (2006) An investigation into the impact of controlled English rules on the comprehensibility, 
usefulness, and acceptability of machine-translated technical documentation for French and German 
users. PhD dissertation, Dublin City University, Ireland
Roturier J, Mitchell L, Grabowski R, Siegel M (2012) Using automatic machine translation metrics to 
analyze the impact of source reformulations. In: Proceedings of the tenth biennial conference of the 
association for machine translation in the Americas (AMTA-2012), San Diego, CA, 10pp
Snover M, Dorr BJ, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with tar-
geted human annotation. In: AMTA 2006, Proceedings of the 7th conference of the association for 
machine translation in the Americas, Cambridge, MA, pp 223–231
Spyridakis JH, Holmback H, Schubert SK (1997) Measuring the translatability of simplified English in 
procedural documents. IEEE Trans Prof Commun 40(1):4–12
Stojanovski D, Fraser A (2018) Coreference and coherence in neural machine translation: a study using 
oracle experiments. In: Proceedings of the third conference on machine translation (WMT), volume 
1: research papers, Belgium, Brussels, pp 49–60
Stojanovski D, Fraser A (2019) Improving anaphora resolution in neural machine translation using cur-
riculum learning. In: Proceedings of the machine translation summit 2019, Dublin, Ireland, pp 
140–150
Toral A, Way A (2018) What level of quality can neural machine translation attain on literary text? In: 
Moorkens J, Castilho S, Gaspari F, Doherty S (eds) Translation quality assessment: from principles 
to practice. Springer, Cham, pp 263–287
Vilar D, Xu J, D’Haro LF, Ney H (2006) Error analysis of machine translation output. In: LREC-2006: fifth 
international conference on language resources and evaluation, Proceedings, Genoa, Italy, pp 697–702
Voita E, Sennrich R, Titov I (2019) Context-aware monolingual repair for neural machine translation. 
In: Proceedings of proceedings of the 2019 conference on empirical methods in natural language 
processing and the 9th international joint conference on natural language processing (EMNLP-IJC-
NLP), Hong Kong, China, pp 877–886
Voita E, Serdyukov P, Sennrich R, Titov I (2018) Context-aware neural machine translation learns anaph-
ora resolution. In: Proceedings of the 56th annual meeting of the association for computational lin-
guistics (volume 1: long papers), Melbourne, Australia, pp 1264–1274
Werthmann A, Witt A (2014) Maschinelle Übersetzung—Gegenwart und Perspektiven. In: Stickel G (ed) 
Translation and interpretation in Europe. Contributions to the annual conference 2013 of EFNIL in 
Vilnius. Lang, Frankfurt am Main, pp 79–103
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, 
Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, 
Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes 
M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and 
machine translation. CoRR abs/1609.08144
Zhang J, Zong C (2020) Neural machine translation: challenges, progress and future. arXiv:2004.05809v1
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published 
maps and institutional affiliations.
1 3
An in‑depth analysis of the individual impact of controlled… 203
Authors and Affiliations
Shaimaa Marzouk1 
1 Johannes Gutenberg University Mainz, Mainz, Germany
1 3