PDF 
Peter Granda and Emily Blasczyk, 201 6
 

Introduction

Harmonization refers to all efforts that standardize inputs and outputs in multinational, multicultural, or multiregional surveys, which we refer to as “3MC” surveys.

Harmonization is a generic term for procedures used predominantly in official statistics that aim at achieving, or at least improving, the comparability of different surveys and measures collected. The term is closely related to that of standardization (see Sample Design and Questionnaire Design).  Harmonizing procedures may be applied in any part of the survey lifecycle, such as study design, choice of indicators, question wording, translation, adaptation, questionnaire design, sampling, data collection, data coding, data editing, or documentation. The need to harmonize arises for all 3MC surveys. This is particularly true if the goal is to combine the data into a single integrated dataset.

There are two general approaches for harmonizing data: input harmonization and output harmonization:

  1. Input harmonization aims to achieve standardized measurement processes and methods in all national or regional populations included in the 3MC survey. Comparability can be realized through standardization of definitions, indicators, classifications, training, and technical requirements.
  2. Output harmonization begins with different national or regional measurements, possibly derived from non-standardized measurement processes. These measurements are “mapped” into a unified measurement scheme. Thus, only the statistical outputs are specified, the individual countries/regions may decide how to collect and process the data necessary to achieve the desired outputs. (Granda, Wolf, & Hadorn, 2010;Lyberg & Stukel, 2010;Multinational Time Use Study [MTUS], 2014; Doiron, Raina, L’Heureux, & Fortier, 2012).  It is also possible to incorporate output harmonization in the original planning to produce datasets for 3MC research as the Luxembourg Income Study has done for many years with both individual and household level data collected from many countries since 1983 (http://www.lisdatacenter.org/).

⇡ Back to top

Guidelines

Goal: To ensure that survey and statistical research teams follow accepted standards when creating harmonized data and documentation files, and use a harmonization strategy that best fits their basic source materials and the objectives they wish to achieve.

1. Decide what type of harmonization strategy to employ, taking into account that many harmonization efforts will require some combination of strategies.

Consider “input” harmonization when the survey process is centrally coordinated.

Rationale

“Input” harmonization, usually applied in a multi-national context, seeks to impose strict standards and protocols from the beginning for the whole survey process by which each national survey applies the same survey procedures and a common questionnaire (see Sample Design and Translation). Also known as “prospective”, this strategy is meant to assure a high degree of comparability. Some adaptations may occur for individual data collection sites, but the goal is to maintain comparability (Doiron et al., 2012).

Procedural steps

1.1    Provide detailed specifications, protocols, and procedures for all aspects of the survey process. The different specifications (Data Protocol, Sampling, Translation, etc.) of the European Social Survey (ESS) and the Demographic and Health Surveys (DHS) Toolkit are good examples (ESS, 2010The Demographic and Health Surveys Toolkit, 2014).

1.2    Decide what items to standardize.

1.3    Consider if variations may be necessary to account for site-specific interests. This can either be due to site specific research foci or resource limitations (Doiron et al., 2012).

Consider “output” harmonization, also known as retrospective harmonization, when the survey collection process is largely determined at the level of individual countries or cultures and there is minimal or no agreement on standardization.

Rationale

This type of harmonization is implemented through two main strategies: “ex-ante” and “ex-post.” In practice, a study may utilize both strategies.

Ex-ante refers to measurements designed to be comparable and harmonized in data processing. When comparability has been considered during survey planning, the understanding of concepts, common goals, and specific targets can be established for the data collection process. The precise wording of the survey items may vary but the items seek to capture the same concept (see Questionnaire Design and Adaptation).

The second variant is an ex-post strategy, by which statistical or survey data are deemed inferentially equivalent and made comparable after the fact through a conversion process (Fortier et al., 2011). The items to be harmonized were not designed to be comparable, but are assessed and edited to achieve commonality. An ex-post strategy can be used in situations where existing repositories will be exploited for comparative research or where intensive early planning is not possible because of financial or policy constraints.

Procedural steps

1.4    Use an ex-ante strategy whenever possible. This enhances comparability since harmonization is addressed at the planning stage of each national data collection, as well as at the end of the process when creating harmonized data files.

1.5    Implement an appropriate planning process.

1.6    Use an ex-post strategy only if no consideration regarding harmonization has been given by data collectors at the start of data collection(s), but researchers later believe (e.g., because of common concepts or similar questions across surveys) that a harmonized data file can be produced through a conversion process to create comparable variables or statistics.  The Integrated Public Use Microdata Series, International (IPUMS-I) and The Integrated Fertility Survey Series (IFSS) are two such examples (www.international.ipums.org); (IFSS, 2014).

1.6.1   For any ex-post plan, ensure that data access, intellectual property, and any other ethical or legal issues are resolved for all intended source studies prior to beginning harmonization with the source in question (Fortier et al., 2011). Even if study investigators have their data publicly available, it is advisable to obtain permission from them if planning to harmonize their data with other datasets. An individual study’s data use agreement may not apply, and a formal request to the respective research ethics or data access committees may be necessary (Doiron et al., 2012).

1.7    Record all decisions about the “conversion” process systematically. One option is to use two separate databases to record all work: a production database which stores the original and harmonized materials, and a user’s database which provides the analysts access to the overall process.

1.8    Make provisions so that all data conversions can be traced back to the original data.

1.9    For any output harmonization technique, adopt a detailed “data processing plan” that includes descriptions of how the producer(s) of the harmonized data deal with the following:

1.9.1   Differences in study design, such as panel or cross-sectional design, and/or in mode of data collection.

1.9.2   Differences across studies with regard to what is measured (e.g., definitions of study population, concepts, variables).

1.9.3   Differences in how to measure (e.g., scale of measurement, wording and routing of questions, respondents asked).

1.9.4   Differences in how estimates are generated (imputation, weighting, or nonresponse adjustments).

1.9.5   Procedures used to create and define harmonized variables, including any harmonized weights calculated.

Lessons learned

1.1    Input harmonization involves adherence to appropriately standardized methodologies throughout the survey lifecycle. For example, the ESS seeks to collect data every other year, uses face-to-face interviews, aims to collect high-precision data, applies detailed sampling and fieldwork protocols, uses standardized translation protocols in all participating countries, aims to achieve standardized response rates, adopts consistent coding procedures, and creates and distributes well-documented datasets in a timely fashion. All of these procedures require greater organizational capabilities and resources throughout the planning and data collection stages. The results are transparent, high quality, and can produce more valuable public-use data files at the end.

1.2    Not all comparative research will be able to follow the same procedures, so it is important to decide which methods are best, given the actual resources, survey process structure, and the intended level of precision. In addition, the creation of such common standards and their implementation at the local level requires considerable expertise. This also may not be available in all 3MC contexts.  The Generations and Gender Programme is a large, longitudinal 3MC survey that studies relationships between parents and children and also between partners.  It is conducted using both a paper-and-pencil instrument (PAPI) as well as computer-assisted Interviewing (CAPI) and seeks to follow consistent harmonization practices.  While much harmonization work occurs centrally, individual country teams are urged to follow certain procedures to improve comparability.  This method requires considerable coordination among components of the survey teams at all levels (Kveder & Galico, 2014).

1.3    Flexibility can be designed. Research sites in different countries may not be able to follow the same procedures, so it is important to decide which methods can be adapted and define procedures for adapting given the actual resources, survey process structure, and the intended level of precision.  For example, the Malaria Indicator Survey is an optional component that can be conducted with or without biomeasure collection (www.dhsprogram.com). The creation of such common standards and their implementation at the local level requires considerable expertise. This also may not available in all 3MC contexts.

1.4    In a working paper, Roland Gunther describes in detail the harmonization efforts surrounding the European Community Household Panel (ECHP) (Gunther, 2003). This survey was designed from the beginning to use input harmonization, with its design of uniform questionnaires as well as detailed definitions, rules, procedures, and models to make comparability across nations easier, and is exemplary of the use of input harmonization. After the first phase of the project, a few countries decided to cease collecting national samples for the ECHP, and instead to conduct their own national surveys, resulting in the need to do ex-post harmonization. Those doing the harmonization work learned that this kind of ex-post harmonization was resource-intensive and required staff experienced in both the original source and target formats of the ECHP framework. They also had to know in detail how their national questionnaires differed. Common problems included concepts heavily affected by national contexts, as well as differences in scales of measurement, variable coding schemes, and definitions of concepts. Solutions to such problems were often found through ad hoc decisions about recoding, combining, or collapsing variables, and almost never through estimation techniques.

1.5    These harmonization strategies are almost never applied exclusively on any single statistical or survey data collection. Depending on specific cultural and national characteristics, data producers should consider strategies that will enable them to collect their data in the most efficient manner. In some situations, they may want to combine strategies. For example, data producers may start with an input harmonization plan, but should be prepared to do some ex-post output harmonization to account for differences across cultures. For example, the Demographic and Health Surveys has standardized instruments but also provides a Standard Recode Manual (DHS, 2015).

1.6    Health researchers, in particular, emphasize the importance of ex-post output harmonization. Because of the volume of datasets generated by national governments and individual investigators which affect public health policies, the desire to pool cases cross-nationally to increase sampled sizes is highly desired.  To insure comparability investigators involved in this process developed a very systematic approach to harmonization and encouraged its use throughout relevant research communities.

1.7   Output harmonization projects also generate copious amounts of metadata describing the source variables, target variables, and the entire harmonization process. This new metadata provides researchers with opportunities to analyze this information and create additional linkages. For example, individual variables can be grouped into substantive categories or concepts to enhance the analytical power of a new, harmonized dataset.

 

2.When deciding which variables to harmonize, create an initial plan and define clear objectives about what you want to achieve. The plan should include making all data conversions reversible.

Rationale

Creating a harmonization plan from the beginning of the project allows data producers to document all of their decisions at the time they are made. In case errors occur or are identified by users at a later time, all data conversions should be reversible.

Procedural steps

2.1    Before fieldwork, consult with experts or an advisory committee on a systematic design process, and with methodology groups to investigate comparability issues. If pre-fieldwork coordination is not possible, form an advisory committee of researchers knowledgeable about the subject matter at the beginning of the harmonization process, if possible, and consult with them regularly.

2.2    Show the advisory group results of the harmonization process at different points in the process to allow for possible changes in rules used to create new variables.

2.3    Consider establishing a testing group of users knowledgeable about the subject matter separate from the harmonization process, who provide feedback on the analytic usefulness of the data before they are released publicly.

2.4    Implement a systematic conversion creation process with appropriate quality controls.

2.5    Identify and become familiar with software tools that facilitate a comparison of variables from different surveys, in order to determine if and how these could be harmonized. Such tools often work from a common database that stores all the information about each variable.

2.6    Establish partnerships with producers of harmonization tools. This may be more beneficial than creating new tools, which often requires costly programming efforts.

2.7    Where software tools are unavailable or impractical, use manual comparisons in making harmonization decisions and consult with substantive and methodological experts in doing so.

2.8    Identify and become familiar with interactive documentation tools that allow for proper and transparent documentation. For example, the DataSHaPER (http://www.p3g.org/biobank-toolkit/datashaper) and Opal (https://www.maelstrom-research.org/what-we-offer/open-source-software) are tools designed to harmonize epidemiological data.

Lessons learned

2.1    Realize that not all concepts measured in the survey process are equally amenable to harmonization efforts. For example, cross-national harmonization of the number of births and marriages is a far easier task than comparisons of divorce rates where local laws, customs, and data collection methods may differ substantially. Other concepts, such as international population migration, may not, due to a lack of precise definition and great variety in measurement criteria, lend themselves to harmonization at all, or only at the most basic level. Three characteristics that could influence harmonization potential are (i) the relative importance to the research intending to use the harmonized items, (ii) the individual the item targets (for example the participant, the participant’s family members), and (iii) the period of time to which the variable refers (Fortier et al., 2011).

2.2    Good decision-making about the harmonization process will benefit from the use of software tools, as well as input from a diverse group of survey researchers who can offer advice on various procedures and techniques to use when producing harmonized files. The ISSP Data Wizard (German Social Science Infrastructure Services, 2010) was used by the International Social Survey Programme (ISSP). It was one of the first tools developed to support procedures that were previously performed manually to harmonize data at the cross-national level. The tool offered rule-based checks, automation of partial steps, and the visualization of certain conditions, to make the harmonization process more efficient, easier, and less susceptible to mistakes.

2.3    The European Values Study (EVS) formed a number of work groups, both before and after fieldwork. The aim on the one hand was to set standards at an early stage, and on the other to consolidate and merge data which had been cleaned by participating national survey teams. This project produced an integrated source questionnaire and a set of equivalency tables to assist secondary researchers. The project web site makes all of this information easily accessible (EVS, 2014). These processes and products provide critical information to secondary users of these data.

2.4    The DataSchema and Harmonization Platform for Epidemiological Research (DataSHaPER) is one potential tool for output harmonization.  Fortier’s 2011 paper showed that using the DataSHaPER across 53 studies, 64% of “essential” constructs from those selected could be harmonized completely or “partially”. This estimate used the most conservative criteria and evaluation of harmonization potential would likely improve this statistic (Fortier et al., 2011). A newer version of this tool is Opal (https://www.maelstrom-research.org/what-we-offer/open-source-software). 

⇡ Back to top

3. Focus on both the variable and survey levels in the harmonization process.

Rationale

Harmonization efforts usually concentrate on comparing and integrating information involving specific variables across data files. However, it is equally important to consider the overall characteristics of the surveys that make them good candidates for harmonization, and to report the decisions involving this process to end users.

Procedural steps

3.1    Recognize the different aspects involved in converting source variables, which might include variable concepts or scales of measurement, into target variables.  The concept of citizenship, for example, presents significant challenges to researchers who want to investigate this topic (Minkel, 2004).

3.2    Describe similarities and differences between the source variables and the target variables, including discussion of universe statements, question wording, coding schemes, and missing data definitions. There may be an unavoidable loss of information resulting from harmonization, such as if a variable that was continuous is being harmonized with a categorical variable (Fortier et al., 2011).

3.3    Consider file-level attributes when creating the harmonized data file, including how survey weights, imputation procedures, variance estimation, and key substantive and demographic concepts will change in the process.

3.4    Pay particular attention to sampling designs and data collection methods in making assessments about the degree of comparability between different surveys. See Survey Quality for a discussion of how quality profiles can be developed and used to assess comparability in a 3MC survey.

Lessons learned

3.1    Data producers must recognize the degrees of individual item or variable persistency when creating questionnaires and collecting data. Item persistency over time is very important in generating harmonized data files. There are considerable differences, for example, between an “absolute” persistent variable, such as “country of birth,” and a less persistent variable, such as “country of citizenship.” The concept might mean different things in different countries, is subject to change, and could be reported validly for multiple countries by some respondents (Minkel, 2004).

3.2    Quota sampling destroys comparability. (Häder & Gabler, 2003; Heeringa & O’Muircheartaigh, 2010). Harmonization will not make data from quota sampling comparable with data gathered via probability sampling. The ISSP is an example of a 3MC survey program that abolished quota sampling.

3.3    The European Social Survey (ESS) provides detailed insight into weighting issues and makes this information available. See the ESS data site for each survey round for the latest version.

3.4    The Collaborative Psychiatric Epidemiology Surveys (2014 ) created a harmonized data file from three comparable surveys on mental health. Data producers created a pooled weight for the harmonized file, based on race/ancestry groupings and on the geographic domains of the sampling frames of each individual survey. Understanding the specific characteristics of each input file was an essential part of creating a harmonized output file (Heeringa & Berglund, 2007). All of this information was provided to users in a comprehensive explanation of the original and harmonized weights.

⇡ Back to top

 4. Develop criteria for measuring the quality of the harmonization process. This includes testing it with users knowledgeable about the characteristics of the underlying surveys, the meaning of source variables, and the transformation of source variables into target variables.

Rationale

Researchers may analyze harmonized files in new and unexpected ways. It is crucial to provide them sufficient information about the concepts and definitions presented, and the assumptions underlying the decisions made in their construction.

Procedural steps

4.1    Devise procedures to judge the quality of the harmonized outputs based on such quality criteria as consistency, completeness, and comparability.

4.1.1   Consistency can be judged by comparing the results from multiple independent efforts of harmonizing a variable; completeness is assessed based on the degree to which the original information is preserved in the harmonized data; and comparability is the degree to which the harmonized outputs can accurately report important social or economic concepts over time or between countries or cultures.

4.1.2   The Statistical Office of the European Communities (EUROSTAT) proposed the following set of quality criteria when reporting statistics which also apply to harmonization outputs (Database of Integrated Statistical Activities, 2014;Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, 2009):

  • Relevance of the statistical concepts.
  • Accuracy
  • Topicality and timeliness of the dissemination of results.
  • Accessibility and clarity of the information.
  • Comparability of the statistical data.

4.1.3   Strictly speaking, these traits apply to official statistical data. However, many of them would apply equally to academically produced survey data, particularly those regarding the comparability of social, economic, and demographic concepts in a 3MC context, and the accuracy of estimates.

4.2    Be prepared to modify and update harmonized datasets after public release, based on comments from the research community, if errors are uncovered, or if certain variables need further explanation.

4.3    Prepare presentations at social science research conferences that describe the harmonization process to potential users.

Lessons learned

4.1    The usefulness of well-harmonized data is clearly recognized by many international organizations. The United Nations Economic and Social Council recognized the importance of harmonizing environmental data collection activities in order to produce comparable indicators on the environment and its relationship to the economy. They determined to bring the System of Environmental-Economic Accounts (SEEA) to an international statistical standard.  The SEEA now provides the first international standard for environmental-economic accounting (United Nations Economic and Social Council, 2005; see also http://unstats.un.org/unsd/envaccounting/seea.asp).  

⇡ Back to top

5. Provide the widest range possible of data and documentation products about the entire harmonization process.

Rationale

Regardless of whether utilizing input or output harmonization as a strategy, all aspects of the survey planning, collection, and dissemination process should be considered when producing harmonized data files or creating accompanying documentation. Users should have access not only to the harmonized end result, but also to detailed information about all steps taken by the producers, as well as source materials, in order for them to fully understand what decisions were made during the entire process.

Procedural steps

5.1    Define the elements of the harmonization process and start documenting it from the beginning in order to ensure that all decisions are captured even before a definite plan to produce a public-use data file exists.

5.2    To the greatest extent possible, document each target variable with information from all source variables, transformation algorithms, and any deviations from the intended harmonized approach.

5.3    If possible, provide users with access to the original data files used in producing the harmonized file. If direct access to original data is not permissible due to confidentiality concerns, implement procedures to assist users in proper check-backs or re-transformations.  Also consider implementing some form of restricted-use data agreement to allow access under controlled conditions.

5.4    Prioritize providing users with the code or syntax used in creating new variables for the harmonized file.

5.5    Provide users with as complete as possible documentation, including crosswalks, which describe all the relationships between variables in individual data files with their counterparts in the harmonized file. An interactive, web-based documentation tool is often the best way to present such documentation.

5.5.1   Include original questionnaires and information about the data collection process whenever possible.

5.6    Report on as many of the following elements of the data lifecycle as it applies to the particular harmonization process:

         Study  Design and Operational Structure:

5.6.1      Project planning.

   Sample Design, Questionnaire Design and Instrument Technical Design:

5.6.2      Sampling frame.

5.6.3      Sample size.

5.6.4      Sample design (See Instrument Technical Design,

              Questionnaire DesignSample Design).

5.6.5      Duration of the field period.

5.6.6      Instrument construction and design.

   Adaptation of Survey Instruments and Translation:

5.6.7      Translation and adaptation (See Translation ).

   Data Collection:

5.6.8      Mode(s) of interview.

5.6.9      Respondent follow-up if panel survey.

5.6.10    Data collection methods (SeeData Collection: Face-to-Face Surveys, Data Collection: Telephone Surveys, Data Collection: Self-Administered Surveys, and Survey Quality).

   Data Processing and Statistical Adjustment:

5.6.11    Editing.

5.6.12    Item nonresponse.

5.6.13    Unit nonresponse.

                                         5.6.14    Any special treatment given to demographic and country-specific variables.

5.6.15    Sample weights.

5.6.16    Variance estimation.

                                         5.6.17    Data production, including both planned and ad-hoc decisions implemented                                                  during variable conversion.

5.6.18    Documentation production.

   Data Dissemination:

5.6.19    Dissemination (See Data Dissemination).

This list is based on documentation provided in the Integrated Health Interview Series (IHIS). The IHIS is an effort to provide an assortment of variables from the core household and person level files from the National Center for Health Statistics’ seminal data collection effort on the health conditions for the US population from 1969 to the present. It provides extensive user notes and FAQ pages to describe how their harmonization project coped with several of these components (Integrated Health Interview Series, 2014).

5.7    Consider archiving the original and harmonized data with a trusted data archive to ensure continued availability of all data and documentation files and long-term preservation. See Data Dissemination for additional discussion regarding archiving.

Lessons learned

5.1    The Eurobarometer Survey Series, in operation since 1973, now includes several dozen cross-sectional surveys, all of which have been harmonized into single cross-national files before being made available to researchers. These surveys are released initially with basic information about each study and the characteristics of all variables, and are then further processed by the social science data archives, led by German Social Sciences Infrastructure Services, to include variable frequencies, more complete documentation, and online analysis services for researchers (Eurobarometer Survey Series, 2014). Such partnerships between data producer and social science data archives encourage long-term preservation, enhance access, and make it possible to continually improve services to the research community.

5.2    Some harmonization projects have gone to great lengths to describe their procedures in specific detail.  For example, the Multinational Time Use Study (MTUS) has a User Guide and a comprehensive description of its coding procedures used in creating its harmonized data file (MTUS, 2014). Similarly, the Generations and Gender Programme (GGP) of the United Nations Economic Commission for Europe Population Activities Unit (UNECE-PAU) provides reports and guidelines about how the organization implements its harmonization decisions (Kveder & Galico, 2014). These projects provide transparency to both creators and users of these data and serve as an example for others to follow.