Janet Harkness, Brita Dorer, and Peter Ph. Mohler, 2016
Appendices:  A | B 


This chapter on translation assessment will consider different forms of qualitative and quantitative assessment related to translation and present the current state of research and relevant literature as available. It is useful to distinguish between procedures that assess the quality of translations as translations and those that assess how translated questions perform on questionnaire instruments. Survey instrument assessment must address both translation and performance quality (Harkness, Pennell, & Schoua-Glusberg, 2004).

Evaluations of the translations focus on issues such as whether the substantive content of a source question is captured in the translation, where there are changes in pragmatic meaning (what respondents perceive as the meaning), and whether technical aspects are translated and presented appropriately (e.g., linguistic and survey appropriateness of response scales). Approaches combining translation, review, and adjudication, as part of the TRAPD model of translation, are seen to be the most useful ways to evaluate and improve translation quality and implicitly underscore the relationship between design and translation.

Assessments of performance can focus on how well translated questions work for the target population, how they perform in comparison to the source questionnaire, and on how data collected with a translated instrument compares with data collected with the source questionnaire. In the first case, assessment may indicate whether the level of diction is appropriate for the sample population, in the second, whether design issues favor one population over another, and in the third, whether response patterns for what is nominally “the same question” differ (or do not differ) in unexpected ways across instruments and populations.

 Translation quality and performance quality are obviously linked, but good translation does not suffice to ensure that questions will function as desired in performance. Thus, well-translated questions may work better for an educated population than for less well-educated population of the same linguistic group, either because the vocabulary is too difficult for the less-well educated or because the questions are less salient or meaningful for this group. Problems of question design, such as asking questions not salient to the target population, should be addressed at the questionnaire design level; they are difficult to resolve in terms of translation. As testing literature points out, question formats also affect responses if the chosen format is culturally biased and more readily processed by respondents in one culture than in another (Geisinger, 1994; Solano-Flores & Nelson-Barber, 2001Tanzer, 2005).

Assessment and evaluation of translation and performance quality assume that criteria of evaluation are available with which to assess the quality of given translation products and benchmarks and that standards exist against which translation products can be "measured". In the survey research field there is only limited consensus on what these criteria and benchmarks might be and what translations that meet these criteria might then look like.

However, items are measurement instruments in comparative survey research. From this follows that in the end the measurement properties of items must be comparable within well-defined limits in comparative research across countries, cultures or regions. There are a number of statistical methods available that allow the researcher to test for statistical comparability (aka equivalence) ranging from Cronbach’s Alpha to Structural Equation Models (See Statistical Analysis) (Braun & Johnson, 2010, van de Vijver & Leung, 1997). Within the Total Survey Error framework other quality issues must also be dealt with (see below).

The guidelines below include several different qualitative and quantitative approaches for translation assessment, identifying criteria of obvious relevance for survey translations and specifying which may or may not be of relevance in a given context. It is unlikely that any one project would employ all the techniques discussed; it is most appropriate for the topic and target population to guide researchers in choosing the most efficient methods of assessment.

⇡ Back to top

Goal:  To assess whether the translation of the survey instrument in the target language accurately reflects all aspects of the source language instrument. The material will be divided into subsections as follows:

Assessment and survey translation quality

 Assessment and evaluation assume that criteria of evaluation are available with which to assess the quality of given translation products and benchmarks and that standards exist against which translation products can be "measured". In the survey research field there is only limited consensus on what these criteria and benchmarks might be and what translations that meet these criteria might then look like.

This section will deal with these issues. It will identify criteria of obvious relevance for survey translations and will identify others which may or may not be of relevance in a given context.

⇡ Back to top

Assessment as part of team translation.


Qualitative assessment of initial translations as they are being developed is an integral and essential component of team translation procedures (see Translation: Overview).

Procedural steps

(See Translation: Overview .)

Lessons learned

1.1    The TRAPD model is one effective method of detecting translation errors. See Willis et al. (2010) for a discussion of the kinds of mistakes discovered at different stages of translation review in projects based on the TRAPD model.

2. Translation assessment using external translation assessors and verification procedures in a quality control framework paradigm.


Various models use external  reviewers and external verification procedures in survey translation efforts. Some projects currently rely on external review teams to provide most of their assessment; others combine internal assessment procedures with outside quality monitoring.

The word “verification” in this context refers to a combination of checking the linguistic correctness of the target version and checking the “equivalence” of that target version against the source version. And, “equivalence” refers to linguistic equivalence, including equivalence in quality and quantity of information contained in a stimulus or test item, as well as equivalence in register or legibility for a given target audience (Dept, Ferrari, & Wäyrynen, 2010). See Johnson (1998a) for more information.

The role of verifiers is to: (a) ensure linguistic correctness and cross-country equivalence of the different language versions of the source instrument; (b) check compliance with the translation annotations provided in the source questionnaire; (c) achieve the best possible balance between faithfulness and fluency; and (d) document all changes for all collaborating countries and any overall project or study coordinators. Verifiers should ideally have prior experience in verifying (or producing) questionnaire translations for other cross-cultural social surveys.

Procedural steps

2.1    An external translation verification firm (e.g., cApStAn) uses a monitoring tool - such as the Translation and Verification Follow-up Form (TVFF) used in the European Social Survey (ESS) - to assess translation and adaptation decisions and to ensure appropriate documentation (see Appendix A; see also Translation: Overview Appendix A for a discussion of the TVFF independent of its utility in assessment).

2.2    The verifier uses the TVFF (or a similar tool) to label each “intervention” (i.e., recommendation for change or other notation) as necessary for each survey item in question.

2.2.1   Examples of intervention categories are “minor linguistic defect”, “inconsistency in translation of repeated term”, “untranslated text”, “added information”, “annotation not reflected”, etc. See Appendix B for complete list of intervention categories used in verification of translations of Round 6 of the ESS. See also complete ESS Round 7 Translation Guidelines (European Social Survey, 2014). 

2.3    The verifiers may prioritize their interventions using the TVFF (or a similar tool):

2.3.1   Interventions are categorized as “key” (an intervention that could potentially have an impact on how the questionnaire item works) or “minor” (a less serious intervention that could improve the translation).

2.3.2   This categorization can help translation adjudicators and other team members to identify which errors are more/less serious.

2.4    Or the verifiers may be asked to require follow-up on all interventions by the national teams, as is the case in ESS Round 7. The idea behind this decision is that no intervention should stay without follow-up by the national teams, otherwise it may be that important corrections are not made if the national teams don’t feel the necessity (European Social Survey, 2014).

2.5    The TVFF (or other documentation form used) is returned to the national team. Each notation by the verifier should be reviewed and any comments/changes/rejections of suggested changes should be marked accordingly. It may be advisable to require the national teams to get back to the verifiers in order to either confirm acceptance of the verification intervention or, in case these interventions are not incorporated, to justify this decision.

Lessons learned

2.1    The purpose of documenting adaptations and other issues in the TVFF is not only to record such issues but also to provide the external verifier with all the relevant background information s/he will need for the verification assignment, to avoid unnecessary comments and changes, and to be as time-efficient as possible.

2.2    The requirement that national teams provide feedback on whether they incorporate verification interventions or not [in the TVFF] provides better control of how verifiers’ suggestions are implemented. In addition, the different loops between the verifiers, national teams and translation experts within the survey may trigger interesting discussions about translation and verification issues.

2.3    Recent use of the verification system by cApStAn in ESS translation assessments has found that verification:

2.3.1   Enhances understanding of translation issues for:

  • The ESS translation team for languages they do not understand;
  • National teams when choosing a translation by encouraging reflection on choices made;
  • Source question designers, enabling them to have a better understanding of different country contexts and issues in translation.

2.3.2   Enhances equivalence with source questionnaire and across all language versions, especially for problematic items.

2.3.3   Gives the ESS translation team a better idea of translation quality/efforts/problems in participating countries.

2.3.4   Prevents obvious mistakes, which otherwise would lead to non-equivalence between countries, from being fielded.

2.4    Systematic external verification streamlines overall translation quality

⇡ Back to top

3. Translation assessment using Survey Quality Predictor Software (SQP) coding


SQP can be used to prevent deviations between the source questionnaire and the translated versions by checking the formal characteristics of the items. SQP coding is meant to improve translations by making target country collaborators more aware of the choices that are made in creating a translation, and the impact these choices have on comparability and reliability of the question. The ESS has been using SQP Coding as an additional step of translation assessment since Round 5 (European Social Survey, 2012).

Procedural steps

3.1    Provide each study country team with access to the SQP (Saris et al., 2011) coding system.

3.2    A team member from each study country uses the SQP program to provide codes for each item in the target country’s translated questionnaire.

3.2.1   SQP codes refer to formal characteristics of items including:

  • Characteristics of the survey question, including the domain in which the variable is operating, (e.g., work, health, politics, etc.), the concept it is measuring (e.g., feeling, expectation, etc.), whether social desirability bias is present, the reference period of the question (past, present, future), etc.
  • The basic response or response scale choices (e.g., categories, yes/no scale, frequencies, level of extremeness, etc.).
  • The presence of optional components; instructions of interviewers, of respondents, definitions, additional information and motivation.
  • The presence of an introduction in terms of linguistic characteristics such as number of sentences, words, nouns, adjectives, subordinate clauses, etc.
  • Linguistic characteristics of the survey question.
  • Linguistic characteristics of the response scale.
  • The characteristics of the show card, if used.

3.3    SQP coding can also be used in the process of designing the source questionnaire.

3.4    The team dealing with SQP coding will then compare the SQP codes in the target language(s) and the source language.

3.4.1   Differences in SQP coding resulting from mistakes should be corrected.

3.4.2   No action is needed for true differences that are unavoidable (e.g. number of words in the introduction).

3.4.3   True differences that may or may not be justified necessitate discussion between the central team and the national team, with possible change in translation necessary.

3.4.4   True differences that are not warranted (e.g., a different number of response categories between the source and target language versions) require an amendment to the translation as submitted.

Lessons Learned

3.1    In Round 5 of the ESS, SQP coding produced valuable information that allowed to detect deviations in translations that – had they been undetected – would have affected the quality of the items as well as the design of experiments (European Social Survey, 2012).

3.2    See ESS Round 6 SQP Guidelines (European Social Survey, 2012, November 6a) and Codebook (European Social Survey, 2012, November 6b) for further detail.

⇡ Back to top

4. Translation assessment using focus groups and cognitive interviews with the target population.

Various  pretesting methods using both focus groups and cognitive interviews can be used to gain insight into the appropriateness of language used in survey translations.

Procedural steps

4.1    Focus groups can be used to gain target population feedback on item formulation and how questions are perceived (Schoua-Glusberg, 1988). They are generally not suitable for assessment of entire (lengthy) questionnaires. To optimize their efficiency, materials pertinent for many items can be prepared (fill-in-the blanks, multiple choice, etc.) and participants asked to explain terms and rate questions on clarity. At the same time, oral and aural tasks are more suitable than written when target population literacy levels are low or when oral/aural mode effects are of interest.

4.2    Cognitive interviews allow for problematic issues to be probed in depth, and can identify terms not well understood across all sub-groups of the target population.

4.3    Protocols should be developed and documented for all types of pretests, with particular care toward designs to investigate potentially concerning survey items (see Pretesting).

4.4    Interviewer and respondent debriefings can be used after all types of pretests, with full documentation of debriefing, to collect feedback and probe comprehension of items or formulations.

Lessons learned

4.1    Focus groups and cognitive interviews are useful for assessing questions in subsections of the target population. For example, focus groups conducted to validate the Spanish translation of the U.S. National Health and Sexual Behavior Study (NHSB) revealed that participants did not know terms related to sexual organs and sexual behaviors considered unproblematic up to that point  (Schoua-Glusberg, 1988). 

4.2    Interviewer and respondent debriefing sessions are valuable opportunities for uncovering problematic areas in translations. Debriefing sessions for the 1995 ISSP National Identity module in Germany revealed comprehension problems with terms covering ethnicity and confirmed cultural perception problems with questions about “taking pride” in being German (Harkness et al., 2004).

4.3    Tape recording of any pretesting allows for behavioral coding for particular questions of interest.

4.4    If computer-assisted pretesting is used, paradata, such as time stamps and keystroke data, can be used to identify items that are disrupting the flow of the interview, and may be due to translation issues (Kreuter, Couper, & Lyberg, 2010).

⇡ Back to top

5. Translation assessment using quantitative analyses. 


Textual assessment of translation quality does not suffice to indicate whether questions will actually function as required across cultures; statistical, quantitative analyses are required to investigate the measurement characteristics of items and to assess whether translated instruments perform as expected. The central aim is to detect bias of different types that distort measurement systematically. Statistical tests can vary depending on the characteristics of an instrument, the sample sizes available, and the focus of assessment (for general discussion, see Geisinger (1994), Hambleton (1993), Hambleton, Merenda, & Spielberger (2005), Hambleton & Patsula (1998), van de Vijver (2003), van de Vijver & Hambleton (1996); van de Vijver & Leung (1997)).

Procedural steps

5.1    Variance analysis and item response theory can be used to explore measurement invariance and reveal differential item functioning, identifying residual translation issues or ambiguities overlooked by reviewers (Allalouf, Hambleton, & Sireci, 1999; Budgell, Raju, & Quartetti, 1995; Hulin, 1987; Hulin, Drasgow, & Komocar, 1982).

5.2    Factor analysis (adapted for comparative analyses: exploratory factor analysis or, confirmatory factor analysis), and multidimensional scaling can be used to undertake dimensionality analyses (Fontaine, 2003Reise, Widaman, & Pugh, 1993; van de Vijver & Leung, 1997). See Statistical Aanlysis Chapter for more information 

5.3    For the evaluation of individual items, item bias can be estimated using multitrait, multimethod procedures (MTMM), as described in Saris (2003) and Scherpenzeel and Saris (1997).

Lessons learned

5.1    Some procedures like SQP used in the ESS (Saris et al., 2011) rely on intensive analyses of questions collected (like a corpus in linguistics). However, the questions accepted as input in the corpus were not systematically evaluated using standard quality inspection such as checking for double barreled or double negation or response scales that do not fit the question etc. Thus the scores obtained might be biased and researchers should carefully use such systems.

5.2    Where scores are relevant (e.g., in credentialing tests), a design is needed to link scores on the source and target versions (Geisinger, 1994).

5.3    The emphasis placed on quantitatively assessing translated instruments and the strategies employed differ across disciplines.

5.3.1   Instruments that are copyrighted and distributed commercially (as in health, psychology, and education) are also often evaluated extensively in pretests and after fielding.

5.3.2   Some quantitative evaluation strategies call for a large number of items (e.g., item response theory) and are thus unsuitable for studies that tap a given construct or dimension with only one or two questions.

5.3.3   Small pretest sample sizes may rule out strategies such as multidimensional scaling and factor analysis.

5.3.4   Some assessment techniques are relatively unfamiliar in the social sciences (e.g., multitrait multimethod (MTMM)).

5.4    Post hoc analyses that examine translations on the basis of unexpected response distributions across languages are usually intended to help guide interpretation of results, not translation refinement. Caution is required in using such procedures for assessment because bias may also be present when differences in univariate statistics are not.

5.5    For multi-wave studies, document any post-hoc analyses for consideration when carrying out future translations. 

⇡ Back to top