Mengyao Hu, 2016
The use of survey auxiliary data (including paradata) to investigate and reduce survey error has gained tremendous attention in survey research because of the wide range of information these data provide about the survey data collection process.
As discussed in Survey Quality, there are frequently methodological, organizational, and operational barriers to ensuring quality, especially in multinational, multicultural, or multiregional surveys, which we refer to as ‘3MC’ surveys. Various errors, such as nonresponse error, measurement error, coverage error, and sampling error, can threaten the final survey estimates. Different approaches can be taken to enhance survey data quality at each stage of the survey lifecycle. In addition to procedures to standardize and improve the survey data collection process , quantitative methods (e.g., benchmark data and simulation studies) and qualitative methods (e.g., cognitive interviews and focus groups) have also been used to investigate error sources (see Survey Quality for more information).
With the significant development of technology, auxiliary data such as paradata have become widely available to researchers, providing additional tools to evaluate and reduce survey error sources.
Auxiliary data and paradata
There is no one universally accepted definition of ‘auxiliary data;’ it has often been defined as all data except the survey data itself . Sampling frame data, data resulting from linkage to secondary datasets, and paradata all fall into the category of auxiliary data.
Paradata has been widely discussed and applied during both data collection and analysis. Couper introduced the term ‘paradata’ into the survey research methodology field , and the definition of paradata has vastly expanded since then. As discussed in the 2011 International Nonresponse Workshop , two main types of paradata are available. One is process paradata, which is collected during the process of data collection, (e.g., timestamps and keystroke data); the other is related to observational information (e.g., the observed demographics of respondents and observed neighborhood conditions). notes that some paradata, like interviewers’ observations about respondents’ characteristics, can fall into both categories.
The types of available and commonly used auxiliary data, including paradata, vary with survey mode. See Appendix A for a detailed description of different types of paradata and other auxiliary data associated with survey modes. The discussion in this chapter includes paradata and other auxiliary data which have been used to investigate and reduce survey errors.
Aims of this chapter
Paradata and other auxiliary data have been collected and well-documented in many surveys. For example, the European Social Survey (ESS) closely monitors the survey process, collects various types of paradata using contact forms , and documents the paradata for each wave of the survey.
Increased access to paradata and other auxiliary data enables researchers to investigate survey error sources in many different dimensions. For example, recordings of the interactions between interviewers and respondents at the doorstep can help to reveal reasons for survey nonresponse; keystroke data, such as that indicating a change in the recorded answer, can help to inform potential measurement errors; and contact history records for random walk sampling can be used to evaluate coverage error (see guidelines below).
In addition to the investigation of errors, paradata is often used in responsive designs. In this case, researchers continually monitor selected paradata to inform the error-cost tradeoff in real time, and use this as the basis for altering design features during the course of data collection or for subsequent waves. For example, to reduce survey error, researchers can implement interventions (e.g., providing additional interviewer training) based on paradata-derived indicators during real-time data collection.
Note that in addition to investigating survey error and monitoring data collection to reduce it, paradata and other auxiliary data can be used for substantive studies. For example, interviewer observational data on graffiti collected in the ESS can be used to study survey error sources (e.g., whether neighborhoods with more graffiti are less likely to respond to surveys) as well as to investigate substantive questions (e.g., whether it is predictive of resident satisfaction and residents’ plans to move). For the purpose of this chapter, we only focus on the investigation and reduction of different survey error sources.
This chapter aims to provide an introduction to the use of paradata for studying and reducing various types of survey error. This chapter follows closely . For a more comprehensive discussion of the use of paradata and other auxiliary data to investigate and reduce survey error, see .
⇡ Back to top
Goal: Consider different ways to use paradata to study and reduce nonresponse, measurement, coverage, and sampling error, which are discussed in turn below.
⇡ Back to top
1. Use paradata and other auxiliary data to investigate nonresponse error.
The model of the biasing effect of nonresponse includes two components: the response rate and the differences between respondents and nonrespondents. The former refers to the proportion of eligible sample units who complete an interview. The latter refers to the magnitude of the differences between respondents and nonrespondents on measures of interest (e.g., mean differences of a survey estimate). If there is no difference between respondents and nonrespondents, then there is no nonresponse bias, regardless of the size of the response rate. If nonrespondents differ from respondents, then the lower the response rate is, the higher the bias is likely to be . The challenge of studying nonresponse bias is that it is difficult to ascertain differences between respondents and nonrespondents, given that there is typically little data available about the latter group. These differences can sometimes be informed by paradata.
For example, if specific paradata are available for both respondents and nonrespondents and can reveal information about the survey outcomes (i.e., completed interviews, refusals, etc.), they may also inform researchers about the likely differences between respondents and nonrespondents . This can help researchers to evaluate nonresponse bias. Examples of such types of paradata include call history data and interviewer observations.
1.1 Investigate the paradata/auxiliary data available to study nonresponse error. It is likely to vary depending upon the mode of the survey interview.
1.1.1 Interviewer observation and interviewer-household/respondent interactions are only available in interviewer-administered surveys.
1.1.2 On the other hand, call history data can be collected in both interviewer-administered and self-administered surveys. For example, in Web surveys, the call history data will be related to any emails and invitations sent to sample units.
1.2 Understand the different types of paradata/auxiliary data that can be used to study nonresponse error. Several types of paradata/auxiliary data have been used to study the propensity to participate in surveys and nonresponse bias (for a detailed discussion, see ).
1.2.1 Call history data can inform researchers about:
- The date and time each call was made.
- The outcome of each call (noncontact, refusal, interview, ineligibility, etc.). For more information, refer to the disposition codes for call outcomes.
- The number of call attempts made.
- The pattern of the call attempts (e.g., time/day of calls).
- The time between call attempts.
1.2.2 Interviewer observations may include:
- Neighborhood safety.
- Whether the sample unit lives in a multi-unit structure.
- Whether the sample unit lives in a locked building.
- Whether there is an intercom system.
- The condition of the housing unit.
- The demographic characteristics of the household.
- Proxy survey variables.
1.2.3 Recordings of the interviewer-householder interaction may capture the information about:
- The doorstep statements.
- Pitch, rate, and pauses of interviewers during speech.
1.2.4 GPS data, if available, can track respondents’ locations over time. Google Maps data can provide general information on the neighborhood and household. Given that there are not many reports of using Google Maps data to study nonresponse error, the validity of this method needs further investigation. Note that the availability of such data may depend on the coverage of Google Maps Street View; some areas, such as China and many countries in Africa, are not well covered. This map shows the areas covered. Google Maps may provide the following information:
- Whether the housing unit is in a multi-unit structure.
- The condition of the housing unit (if it is a single housing unit which can be observed on the map).
- Some socioeconomic characteristics (i.e., the racial group) of the neighborhood may be inferred from the number and types of stores or restaurants in the area (e.g., if there are several Chinese stores or restaurants, it is likely that many Chinese people may live close by).
- Whether there are many abandoned houses with broken windows in the area, indicating a lower level of safety.
1.3 Know how each type of paradata and other auxiliary data can inform researchers about nonresponse error, and select paradata/auxiliary data based on the purpose of the investigation.
1.3.1 Call history data can be used to study nonresponse error in several ways:
- To study the best time to call (‘best’ here referring to the call time that yields the highest cooperation rates). For example, previous literature on the best times to call found that for landline surveys, weekday nights and weekends were better than weekday mornings and afternoons . For cell phone interviews, weekday afternoons are also a good time to call. Cultural/regional differences need to be taken into consideration in evaluating the best time to call. A study conducted in the U.S. Virgin Islands, Guam, Puerto Rico, and the mainland U.S. found that the best time to make contact varied among these regions, likely due to cultural and working time differences .
- To evaluate the relationship between cost and response rates. Additional call attempts are likely to lead to improved response rates, but also increase the field time and cost of survey operations.
- To monitor call records in real-time or on a daily basis. This can show the relationship between the average number of calls and expected response rates.
- To study nonresponse bias after data collection.
1.3.2 Interviewer observations and Google Map data can be used to study:
- Survey participation, contactability and cooperation . For example, interviewer observations on neighborhood safety can be used to examine survey participation—neighborhood safety is often associated with contactability and cooperation, and may reveal the reasons for low participation in certain areas . Researchers can send invitation letters to respondents in unsafe areas to improve their trust of the interviewers.
- Access impediments using observation data on housing units.
- Demographic characteristics . For example, found low response rates among individuals with high incomes.
1.3.3 Interviewer-householder/respondent interactions can be used to study:
- Survey cooperation. Interaction at the doorstep is found to be highly correlated with survey participation . If certain interviewers lack persuasive skills, or if they make many mistakes during the doorstep introduction, special training can be provided.
- Topic salience of the survey. Refusals due to the lack of interest in the survey topics can be used to study survey cooperation . It also indicates potential risk for nonresponse bias, since in this case respondents are likely to differ from nonrespondents.
1.4 Study the relationship between paradata and nonresponse error. For example, see whether the paradata indicators are predictive of survey participation and cooperation. Some paradata, such as the interviewer-respondent interaction data, can also be used to diagnose nonresponse bias (e.g., in the evaluation of topic salience).
1.1 Using paradata on interviewer and respondent interactions, studied the relationship between survey participation and the interviewer’s speech behaviors. They found that an interviewer’s pitch can influence survey participation. In addition, when interviewers were moderately disfluent, respondents were more likely to agree to participate in surveys.
1.2 studied the relationship between paradata and cooperation and contactability of households in the National Health Interview Survey (NHIS) sample. They found that refusals due to health-related reasons were associated with important survey variables. This suggests that topic salience-related paradata can be used for studying nonresponse bias.
1.3 studied nonresponse across different countries using paradata including call record data from the European Social Survey (ESS), data on interviewer attitudes, and doorstep behavior. They found that countries differ in their contacting and cooperation processes, which can in part be explained by interviewer effects, such as contacting strategies and doorstep behavior.
⇡ Back to top
2. Use paradata and other auxiliary data to reduce nonresponse in a responsive design framework for quality control purposes in data collection.
To increase response rates and/or achieve a more balanced sample, researchers often use paradata to monitor data collection in real time and provide interventions, often in subsequent waves, accordingly (known as paradata-driven responsive design). For example, researchers can use paradata and statistical algorithms to optimize calling strategies (e.g., finding the best time to call or determining how many call attempts to make) for subsequent waves of data collection.
listed the following four steps for responsive design using paradata-derived indicators:
2.1 Before starting the survey, pre-identify a set of design features that may affect nonresponse error, such as unit and item nonresponse.
2.2 Identify a set of indicators of the nonresponse error and monitor those indicators in the initial phases of data collection.
2.2.1 Review and identify key performance indicators (KPIs) derived from paradata, such as interviews per hour, daily completion rate, and average interview duration. See for a review of the summarized KPIs.
2.2.2 Select the appropriate indicators.
2.2.3 Monitor the indicators in the initial phase of data collection on a daily or weekly basis.
2.3 Implement interventions by altering the active features of the survey, either immediately or in subsequent phases, based on cost/error tradeoff decision rules.
2.3.1 Different types of management interventions can be implemented. The 2006–2010 National Survey of Family Growth (NSFG) implemented three different types of management interventions, as described in :
- Case prioritization, which aimed at “checking whether the central office could influence field outcomes by requesting that particular cases be prioritized by the interviewers.” The prioritized cases were flagged in the sample management system, and interviewers were asked to put more effort into these cases.
- Screener week, which refers to “shifting the emphasis of field work in such a way that eligible persons (and proxy indicators of nonresponse bias for those persons) would be identified as early as possible.” In this process, with the attempt to reduce nonresponse in the screener interviews, increased efforts were made to reach previously not-contacted screener sample. In addition, demographic information collected in the screening interviews can help with sample balance.
- Sample balance, as described in , is sought in order to “minimize the risk of nonresponse bias by endeavoring to have the set of respondents match the characteristics of the original sample (including nonresponders) along key dimensions, such as race, ethnicity, sex, and age.” In NSFG, variation in subgroup response rates was chosen as a proxy indicator for nonresponse bias. To bring the composition of the interviewed cases closer to the true population, researchers in NSFG prioritized cases from subgroups that were responding at lower rates.
2.4 Combine data from the separate design phases into a single estimator.
2.4.1 After the first three steps, data from all phases are combined to produce final survey estimates. Proper weighting procedures and instructions for variance estimations need to be provided accordingly.
2.1 Researchers found that demographic factors in NSFG are predictive of key survey variables. Therefore, differences in response rate across demographic groups can be used as indicators for potential nonresponse bias. Assuming that nonresponders are missing at random (i.e., nonresponders and responders within each subgroup do not differ with respect to the survey variables being collected), equal response rates across different demographic groups will minimize the size of difference between respondents and nonrespondents. Paradata such as interviewer observations can be used as proxy indicators in a responsive design to reduce nonresponse bias .
2.2 Researchers can use interviewer-generated paradata to make judgments about the likelihood that individual sample cases will become respondents. By building predictive response propensity models using paradata, it is possible to estimate the probability of whether the next call on a sample case will produce an interview or not at daily level. In the NSFG, the collection of paradata and other auxiliary data began at the listing stage of the sample and ended at the last call on the last case .
2.3 When specific subgroups of active sample cases in the 2006–2010 U.S. NSFG were found lagging on key process indicators, interventions were implemented for quality control. For example, if the response rates of older male Hispanics lagged, interviewers could target cases in this specific subgroup, flagging them in the sample management system. Priority calls could then be assigned to these cases .
2.4 The European Social Survey (ESS) uses information extracted from various auxiliary data such as call records to provide feedback to fieldwork organizations for the next round and analyze nonresponse. The contact forms allow them to calculate response rate across countries and compare field efforts across countries. As mentioned by , using ESS auxiliary data, optimal visiting time can be predicated and respondents can be classified according to field efforts in an attempt to analyze nonresponse bias. However, real-time intervention remains a challenge for the ESS since many countries still use paper-and-pencil questionnaires .
⇡ Back to top
3. Use paradata and other auxiliary data to study nonresponse to within-survey requests.
In addition to the traditional interviewing techniques of asking respondents questions based on a questionnaire, surveys often collect data which does not follow the question-and-answer format . Examples include collecting biomeasures (such as height, weight, and blood pressure), requesting access to administrative records (such as individual-level identification numbers like driver’s license or insurance record information), and asking respondents to mail back a questionnaire left behind after an in-person interview . Such requests often include asking for additional permissions and mode switching (e.g., from a face-to-face to a mail survey, as in the leave-behind questionnaire), which is likely to produce nonresponse .
Respondents’ refusal to within-survey requests produce a second layer of nonresponse, which may potentially bias the survey estimates. Paradata can be used to investigate the nonresponse error and, as discussed above, for prediction and possibly prevention/reduction of nonresponse. For a more detailed review of this topic, see .
3.1 Use paradata and other auxiliary data to investigate respondents’ consent to link survey administrative records, if applicable.
3.1.1 As mentioned by , two hypothesized reasons for such refusal are privacy/confidentiality concerns and general resistance toward the survey interview. Use interviewer observations related to both of these issues to understand nonresponse. The Health and Retirement Study has collected interviewer observations on both, as described below .
- Questions regarding privacy concerns: “During the interview, how often did the respondent express concern about whether his/her answers would be kept confidential? (never, seldom, or often)”
- Questions regarding general resistance toward the survey interview: “How was respondent’s cooperation during the interview? (excellent, good, fair, poor)”
3.1.2 Use call history information, such as “ever refused to participate in the survey,” “total number of call attempts for a completed interview,” and “whether the respondent was a nonrespondent in previous waves,” to predict general resistance.
3.1.3 Create a paradata-based index, and use the index to predict respondent likelihood to consent to link survey administrative records.
3.1.4 Provide interventions in subsequent waves of the survey:
- When paradata indicates privacy/confidentiality concerns, flag those cases in the system and ask interviewers to provide more justification on why the administrative records are needed and how confidentiality of the data is ensured.
- When paradata indicates the likelihood of general resistance, provide interventions such as asking the interviewer to emphasize that this will be a less burdensome process. Using a shorter questionnaire can also help.
3.2 Use paradata and other auxiliary data to study respondents’ consent to collect biomeasures.
3.2.1 Paradata such as interviewer observations (e.g., whether the respondent asked how long the interview would last), call record information (e.g., the number of contact attempts in each wave), and timestamps can be used to predict the likelihood respondents will consent to biomeasure collection.
3.2.2 Model biomeasure consent using paradata, and evaluate whether it is predictive of respondents’ likelihood to consent.
3.2.3 For future waves or surveys, provide interventions for those with a low likelihood to consent:
- Identify respondents with a low likelihood to consent using selected paradata. For example, those who were nonrespondents in previous waves or who required many calls before participating may have a lower likelihood of consenting.
- Provide interventions such as increased incentives or justifications of the importance of biomeasure data collection.
3.3 Use paradata and other auxiliary data to study respondents’ consent to the switching of data collection modes.
3.3.1 The following paradata can be used to predict the likelihood of respondents refusing or dropping out during the mode switch.
- Item nonresponse rate in the initial mode.
- If contact information such as an email address or cell-phone number is needed for the mode switch, whether the essential contact information was provided.
- The amount of time that elapsed during the interview using the first mode.
3.3.2 Model refusals and dropouts in the mode-switching process using paradata and evaluate whether they are predictive of respondents’ likelihood to consent and perform the mode switch.
3.3.3 For future waves, researchers can pre-identify and intervene with persons unlikely to take part in the mode switch:
- Use paradata such as item nonresponse in the initial mode to pre-identify respondents with a low likelihood to take part in the mode switch.
- Provide interventions such as incentivizing persons flagged with a low likelihood to participate or, if response is critical, precluding the switch to a self-administered mode and resuming the interview in the initial interviewer-administered mode. Note that the measurement advantages of self-administration would be lost if applying this intervention.
3.1 utilized different types of paradata (primarily interviewer observations and call record information) to study consent to the linkage to survey administrative records. A paradata-derived index was constructed and found to be strongly predictive of the consent outcome.
3.2 used data from the 2006 HRS to analyze the relationship between paradata and consent to biomeasures. The paradata items used to predict consent included interviewer observations, call record information, and timestamps. Overall, holding respondent demographic and health characteristics and interviewer attributes constant, the interviewer observations and call record history data were strongly predictive of consent.
3.3 used paradata from a telephone survey of University of Maryland alumni to predict mode switch response. In the main study, alumni were contacted by telephone and administered a brief screening interview. Eligible respondents were then randomly assigned to one of three mode groups: interactive voice recognition (IVR), Web, or continuation of interview via the telephone. They found that the paradata were predictive of respondents’ participation in the mode switch. For example, item-missing data in the screening interview was found predictive of mode switch dropout for the Web group. Whether or not an email address was provided was also predictive of mode switch dropout.
⇡ Back to top
4. Use paradata and other auxiliary data for nonresponse adjustment.
One strategy to minimize the effect of nonresponse error is to use nonresponse adjustment weights in the post-processing of the survey data. Traditional variables used to construct nonresponse adjustment weights are those available on informative sampling frames. For example, for a survey on students at a school, the age, gender, and grade information is known. In a panel survey, information from previous waves can also be used to create nonresponse adjustment weights. In many surveys, an informative frame may be unavailable, and variables on the frame may have small effects on survey estimates . Most recently, researchers have used auxiliary data (e.g., variables collected about the sample households such as interviewer observations or Google Maps data) in the weighting process for nonresponse adjustment. The procedural steps and examples are described below. For a more comprehensive review on the use of paradata for nonresponse adjustment, see .
4.1 Understand the characteristics of an ideal auxiliary variable for sample-based nonresponse adjustment. Four characteristics are summarized in .
4.1.1 Non-missing values of the variable must be available for both respondents and nonrespondents.
4.1.2 The variable should be measured completely and without error for all sampled units.
4.1.3 This variable should be strongly associated with important survey variables of interest.
4.1.4 This variable should also be a strong predictor of survey participation, thus reducing nonresponse bias in an adjusted mean.
4.2 Identify paradata that are available for both respondents and nonrespondents.
4.3 Select paradata- or auxiliary data-derived variables which can predict survey participation and which are associated with the survey variables of interest . Examples of such variables can be:
4.3.1 The sample unit’s neighborhood.
4.3.2 The sample unit’s housing unit.
4.3.3 Persons in the sampled housing unit.
4.3.4 Call record information collected as part of the sample management system.
4.3.5 Observations recorded by interviewers about their interaction with the sampled household at each contact.
4.4 Develop unit nonresponse adjustment weights using selected paradata/auxiliary data for nonresponse adjustments.
4.1 found that the challenges of using paradata for nonresponse adjustment depend on how well the paradata meet the four characteristics of auxiliary variables mentioned in Guideline 4.1. There are several reasons for this. Firstly, the paradata itself may be subject to item nonresponse, posing many unknown questions to researchers. For example, researchers need to decide whether to impute these item-missing data on paradata or not, what imputation methods to use, and how to create weights if imputation is done. Secondly, no single variable can be a strong predictor for both participation and the survey variable of interest. Therefore, multiple paradata, in addition to other auxiliary variables, are often used when creating weights. Thirdly, measurement error in paradata itself needs further research. For example, interviewer observations on neighborhood, household unit, or interactions with respondents and other household members can also be subject to interviewer effects.
4.2 As mentioned above, in practice, it is challenging to identify variables that relate with both propensity to respond and the survey outcome variables. Previous literature found that paradata, as studied in several surveys, can have a very low correlation with the survey outcome variable . Level-of-effort paradata and nonresponse adjustment models for a national face-to-face survey examine the use of the level-of-effort paradata (e.g., number of calls, ever refused) for nonresponse adjustment in HRS data. The model comparison results reveal that the level-of-effort paradata are predictive of the probability of response. However, they are not predictive of key survey outcomes. Therefore, they are excluded from the model level-of-effort paradata and nonresponse adjustment models for a national face-to-face survey. also evaluated the use of level-of-effort paradata on nonresponse adjustment, and their model fails to remove nonresponse bias. Their paper argues that this may be attributable to errors associated with the paradata .
4.3 examined whether interviewer observation data from the ESS could be useful for nonresponse adjustments in three selected countries: Greece, Portugal, and Poland. They compared differences in point estimates for data weighted only with selection weights and those weighted with selection weights and a nonresponse adjustment weight based on interviewer observation. They found that the interviewer observations affected the survey point estimates only when there was a correlation between the survey variables and the key survey statistics and there were small changes in point estimates. Weak correlations were found between response propensity, survey outcome, and interviewer observations, which, as mentioned by the authors, may be due to the high interviewer variability.
⇡ Back to top
5. Use paradata and other auxiliary data, such as audit files produced by computer-assisted interviewing (CAI) software, to monitor interviewer performance and identify those interviewers who could benefit from more training.
Nonstandardized interviewer behavior (such as interviewing too quickly or too slowly, not reading the questions as worded, or using improper probes) may introduce measurement errors in surveys. Paradata and other auxiliary data such as timestamps and behavior coding can help to monitor such behavior over time. If some interviewers are not following instructions, interventions can be applied to reduce the measurement error attributable to these interviewers.
5.1 Identify a set of indicators of measurement error (e.g., very short interview time) which may be affected by interviewers. More detailed discussions on interviewer performance indicators can be found in , , and . Examples include:
5.1.1 Timestamps: interviews that are unusually short or unusually long suggest potential measurement error. The source may be the respondent. Respondents may rush through the interview or provide many ‘don’t know’ responses, or respondents of low cognitive ability may take more time understanding and answering questions. However, interviewers can also be the source of measurement error indicated by short or long interview times. For example, interviewers may intentionally skip certain questions, read very fast, or even falsify interviews.
5.1.2 Behavior coding: developed from audio recordings of the interview, behavior coding can detect whether interviewers follow standardized interviewing instructions. Examples include whether they are reading the question exactly as worded, whether they are reading at a proper speed and not too fast or too slow, and whether they are providing appropriate probing.
5.1.3 Questionnaire routing: routings may be inconsistent with the instrument introductions, or an interviewer may purposely lead the respondents into a specific skip pattern in order to lessen the interview time by skipping followup questions. If in an interpenetrated sample design (see Interviewer Recruitment, Selection, and Training) where respondents are randomly assigned to interviewers, such falsifications may be detected by comparing the questionnaire routings followed by a specific interviewer with those followed by other interviewers.
5.1.4 Keystrokes: this type of paradata can reveal whether interviewers press certain function keys, whether they change their answers, and how they navigate. Behavior such as constantly changing answers may indicate measurement error.
5.1.5 How many times error or help windows were displayed: if the error or help windows display very often for a specific interviewer, it may indicate that this interviewer is not familiar with the instrument, and may introduce measurement errors.
5.1.6 Paradata about interviewers themselves, such as age, gender, pay grade, skills, experience, and personality traits: such demographic information can be obtained from separate data collection exercises .
5.2 Monitor the selected indicators from the beginning of data collection, both at an aggregate level and an individual interviewer level. Based on the indicators, researchers can not only see trends (e.g., whether interviewers spend more or less time per completed interview as time goes on), but can also evaluate the performance of each interviewer .
5.3 Identify interviewers who deviate from standardized interviewing. If irregularities or falsified data are discovered for a specific interviewer, they should be flagged and appropriate interventions conducted.
5.4 Implement interventions such as providing more training, monitoring future outcomes, and looking for improvement by those interviewers.
5.1 propose and evaluate a calculated interviewer performance indicator which can be used in all interviewer-administered survey modes. This indicator gives each interviewer a score reflecting their performance, giving more weight to successful interviews on difficult cases vs. relatively easy ones. Their paper reports that “calling-history paradata are the strongest predictors of obtaining interviews in both face-to-face and telephone interviews.”
5.2 One example of using paradata and other auxiliary data to monitor interviewer behavior is the Saudi National Mental Health Survey, where interviewers are evaluated in real time based on various indicators (question field time, failed verifications, long pauses, etc.) Interviewers are ranked based on these indicators, and the worst three are flagged and examined in detail .
5.3 The China Mental Health Survey uses auxiliary data including paradata to monitor the survey process. Interviews are flagged if (1) the response time for a certain number of variables is below a predetermined threshold, (2) the adjusted interview length is below a predetermined threshold, (3) the number of ‘don’t know’ or ‘refused’ responses are above a predetermined threshold, or (4) interviews are completed between 11pm and 7am. Interviewers are also flagged for further investigation if the number of interviews completed is significantly higher than other interviewers, or if the distribution of selected variables deviates significantly from other interviewers .
5.4 In the Panel Study of Income Dynamics (PSID), audit trail data (ADTs) produced by the Blaise software, including interview length, average question count per interview, entry of question-level notes, use of question-level help, and backups within the interview, were used to monitor interviewer behavior .
5.5 To develop a “replacement mechanism for traditional interviewer evaluation techniques,” the NSFG explored the use of paradata in interviewer monitoring and evaluation. This concept was later adopted by the Health and Retirement Study, with case-level information being added and mean/median indicators being monitored. The paradata- and auxiliary data-related indicators used in the evaluation process included field time, error escapes, suppressions, jumps, backups, ‘don’t know’ and ‘refused’ responses, help key use, and remarks entered. Three indices were constructed using these nine indicators: “too fast,” “many error checks,” and “many ‘don’t know’ and ‘refused’”. Interventions made based on the evaluation results included extra practice interviews, re-training on proper interviewing techniques, increasing the number of verification interviews, and group re-training and investigation at case level .
5.6 In the Ghana Socioeconomic Panel Study, keystroke data from a computer-assisted personal interview (CAPI) was used to monitor and evaluate interviewers’ performances. Given the complexity of the survey instrument, a questionnaire ‘dashboard’ was designed to show the status of all the questionnaire sections and all the respondents within the household. This allowed interviewers to jump to any section or block of questions in the questionnaire quickly and to switch respondents easily. Keystroke data was used to evaluate interviewers’ navigation patterns, and researchers also identified interviewers with a higher number of mid-section or mid-block exits—i.e., those who ‘jumped around’ too much. The increased movement within the questionnaire was also found to be related with increased interview length and cost. To reduce survey time and cost, proper interventions can be provided to the identified interviewers .
6. Use paradata and other auxiliary data to investigate measurement error in Web surveys, so that real-time intervention can be applied to reduce potential measurement error.
Paradata and other auxiliary data in Web surveys can identify the device used to complete the survey (e.g., laptop, smartphone, or computer) and can inform the entire process of filling out the questionnaire (e.g., how respondents navigate and the time spent on each question). Such navigation and timestamp paradata can be used to monitor respondents’ behavior in the data collection process. If the paradata indicates that some respondents are employing satisficing behaviors, special prompts can be provided to those respondents. For a more comprehensive review of the use of paradata in Web surveys, see .
6.2 Understand the types of paradata available in Web surveys. identifies two major types of Web survey paradata: (1) paradata that indicate the device respondents are using (device-type paradata) and (2) paradata that reveal the navigation process respondents use (navigation paradata).
6.2.1 Device-type paradata include:
- The browser used (user agent string).
- The operating system.
- The language of the operating system.
- The screen resolution.
- The browser window size.
- Adobe Flash, or other active scripting-enabled devices.
- The IP address of the device used to fill out the survey.
- The GPS coordinates at the time of beginning the interview and at the time of completion.
- Cookies (text files placed on the visitor’s local computer to store information about that computer and the page visited).
6.2.2 Questionnaire navigation paradata include:
- Authentication procedures.
- Mouse clicks.
- Mouse position per question/screen.
- Change of answers.
- Typing and keystrokes.
- Order of answering.
- Movements across the questionnaire (forward/backward) and scrolling.
- Number of appearances of prompts and error messages.
- Detection of current window used.
- Whether the survey was stopped and resumed at a later time.
- Clicks on non-question links (e.g., hyperlinks, frequently asked questions (FAQs), and help options).
- Last question answered before stopping the survey.
- Time spent per question/screen.
6.3 Understand the privacy and ethical issues involved in collecting Web survey paradata. Some paradata can be used to identify respondents, such as IP addresses or email addresses. Researchers need to carefully protect such data .
6.4 Become familiar with the software used to collect Web survey paradata, such as Client-Side Paradata (CSP) by Dirk Heerwegh and User-ActionTracer (UAT) .
6.5 Identify the types of operating systems and devices respondents are using. This will inform researchers about measurement error, usability, and comparability issues related to different devices.
6.6 Use questionnaire navigation paradata to better understand the process respondents are using to answer the Web survey. With detailed navigation paradata, researchers can even reconstruct the survey-taking experience .
6.7 Use navigation paradata to evaluate the quality of the Web survey. For example, respondents who rush through the survey may be more likely to provide poor quality data. Quality indexes can be calculated based on available navigation paradata. For instance, in a study aimed at improving establishment surveys, created a quality index based on client-side paradata. The index included the number of prompts, error messages, and data validation messages.
6.8 Use paradata and other auxiliary data to identify satisficing behaviors or obstacles respondents have in Web surveys so that real-time intervention can be applied to reduce potential measurement error. For example, if respondents take a long time to answer a question, given the possibility that they may be having difficulties understanding the question, a definition or clarification prompt can be provided. Similarly, if the paradata indicates that respondents are speeding through the questionnaire, a prompt can be designed to ask them to take their time and read the questions carefully (see ‘Lessons learned’ below).
6.1 Paradata can be used to guide interventions for quality control in Web surveys. To improve data quality, implemented an experiment based on the speed respondents took to answer each question in a Web survey. If a respondent provided an answer in less than the typical minimum reading time, the respondent was prompted with a message encouraging them to put more thought into the question. The study found that prompting increased completion time but reduced straightlining in grid questions, which is usually viewed as an indicator of satisficing .
⇡ Back to top
7. Use paradata and other auxiliary data to investigate measurement error in non-Web surveys.
Paradata and other auxiliary data such as behavior coding data can help researchers better understand interviewer behavior and response characteristics which, in turn, can be related to measurement error . For example, researchers have found that doubts and/or less confident expressions in surveys, such as “I don’t know” or “maybe,” are predictive of less accurate responses . For a detailed review of this topic, see .
7.1 Choose appropriate respondent paradata/auxiliary data as indicators of measurement error. Examples might include:
7.1.1 Response times: these can indicate potential problems in survey questions, and can indirectly reveal respondent uncertainty. There is a great quantity of literature evaluating the relationship between response time and measurement error (e.g., ).
7.1.2 Keystrokes and mouse clicks: these can be used to study respondents’ navigation, change of responses, the response process when respondents use different designs (e.g., dropdown lists vs. radio buttons), and the use of clarification features .
7.1.3 Behavior codes: researchers can use behavior codes to investigate the question-answering process and to study respondents’ uncertainty about their answers, which can be related to measurement error . Indicators of respondent uncertainty include:
- Verbal expressions of doubt and uncertainty, such as “I am not sure” .
- Nonverbal expressions of doubt and uncertainty, such as hesitation, raised eyebrows, or averted gaze .
- Paralinguistic and verbal cues, such as “Um…” .
- Answering too quickly .
- Answering too slowly .
- Changing responses .
7.1.4 Interviewer evaluations: interviewers are often asked to evaluate respondents’ level of cooperation at the end of the interview. This is one of the most straightforward ways of analyzing the relationship between paradata/auxiliary data and measurement error.
7.2 Choose appropriate interviewer paradata/auxiliary data as indicators of measurement error. Examples might include:
7.2.1 Behavior codes: researchers can use behavior codes to identify interviewer behavior deviating from standardized interviewing, which can be related to measurement error .
7.2.2 Interviewer vocal characteristics: these data, such as speech rate and pitch, are usually measured from audio recordings.
7.2.3 Interviewers’ demographic characteristics and attitudes about the survey process as well as the specific project’s substantive attitudes, all of which can contribute to measurement error .
7.3 Use paradata, especially keystroke data, to replicate respondents’ navigation processes and to investigate usability issues in CAI systems. For example, it can reveal whether a special function key has been used by respondents, and whether respondents are having difficulty typing the answers to open-ended questions.
7.3.1 To reduce measurement error, use paradata and other auxiliary data to improve survey questions: “various types of paradata such as question timings, keystroke files and audio recordings can provide an indication of respondent difficulty in answering survey questions” . Researchers can use paradata to detect survey questions with potential problems in the pretest and then to revise and improve the questionnaire.
7.4 Use paradata to improve the survey response process during data collection (real-time responsive design). For example, researchers can use paradata to detect situations where respondents may be having difficulty answering the questions (e.g., they take a long time to respond). Systems can provide tailored clarifications for those respondents .
7.5 Set standards for how these paradata and auxiliary data can be collected. See for methods of collecting paradata for the purpose of measurement error investigation.
7.6 Analyze paradata/auxiliary data based on the research question. As described in , the analysis of paradata needs to consider the following factors:
7.6.1 Determine units of analysis. Sometimes, the paradata obtained are nested in nature. For example, “response times, mouse clicks, keystrokes, verbal behaviors, and vocal characteristics are recorded for each action taken for each question item for each respondent, nested within each interviewer for a given survey” . Different systems of aggregation can be used to organize the data. See for detailed examples. Note that there is no single ideal way to choose the unit of analysis. Decisions must be made based on the specific research question.
7.6.2 Manage the data. The management of data can be different for each type of paradata .
- Response time: if response latencies (the time spent until the occurrence of an event) are of interest, researchers need to calculate “the differences in time from the beginning of an event to the end of the event” . As proposed by , four factors must be considered when analyzing response time:
(1) The validity of the data, which largely depends on the survey mode: in Web surveys, researchers must decide whether server-side or client-side paradata are more valid. For example, response time collected at the server-side includes upload and download time, which is generally longer than client-side data (also see Guideline 9.1.1).
(2) The presence of outliers, which may distort the results if kept for analysis. Outliers can be defined in many ways. The most common way is based on the number of standard deviations the data point is from the mean length of time. Usually, once identified, outliers are excluded from analysis. An alternative method is to impute or use other cases to replace them.
(3) Whether the distribution is normal or skewed: if the latter, transformation of the data can be done to deal with the skewed distribution.
(4) Baseline adjustment: people have different cognitive abilities and may differ in their speed of talking. Thus, it is natural that some respondents simply answer survey questions more quickly than others . In situations where such differences are not of research interest, it can add ‘noise’ and increase measurement error to the response time data. state that “to account for these differences, researchers can subtract a ‘baseline’ measurement calculated separately for each respondent from the average of response timings to multiple items external to (and usually administered prior to) the questions of interest from the response timing data for the questions of interest.”
- Keystrokes and mouse clicks: management of these two types of paradata depends on the level and unit of analysis. Researchers must decide whether to analyze them at an action, question, section, or survey level. Unlike other types of paradata, they are collected as dichotomies—‘yes’ (1) or ‘no’ (0). recommend that researchers first evaluate whether there is enough particular keystroke data to analyze statistically. If events are rare, similar events can be combined for analysis.
- Behavior codes: data can be obtained in different ways. The most detailed analysis is based on transcriptions of interviews; alternately, coders can listen to the interviews and code the data. As a first step, researchers must decide whether to code from transcriptions or by listening to recordings. Second, detailed and comprehensive coding instructions are needed to guide the coders. Usually, at least two coders are required to ensure the reliability of the coding process. After coding is complete, the data must be examined and unreliable codes removed from the analysis.
- Vocal characteristics can be obtained and processed using software like Praat.
- Interviewer evaluation data can often be analyzed directly. The most common data management issues related to interviewer evaluations is item nonresponse. Methods such as multiple imputations can be considered to deal with this issue.
7.6.3 Use statistical analysis to answer the research questions. Based on the research questions, various methods of analysis are available. For example, response latencies can be analyzed by comparing mean time spent to answer specific questions. Statistical models such as survival models or logistic regression models can be employed to evaluate response latencies and other paradata such as keystroke data. In addition, based on the research questions, paradata can be used as either independent or dependent variables in the analysis .
7.1 studied the relationship between response time and response error, collecting response time via audio recordings. They found that the longer a respondent spent answering a question, the less likely the respondent was to give a correct answer.
7.2 studied factors influencing response time. They found that many things can affect response time, including both item-level characteristics (e.g., the length and difficulty of a question) and respondent characteristics (e.g., age, education, and experience with the Internet).
7.3 Response time can be studied, not only at the individual question level, but also at the entire questionnaire level. evaluated the relationship between measurement error and the time respondents spent completing a questionnaire. They reported that more satisficing behavior was found for low-education respondents in the fastest group.
7.4 Using various paradata, evaluated whether the use of mobile devices (tablets and smartphones) affect survey data quality in Web surveys. They found that mobile device users “spent more time than desktop/laptop users to answer the survey.” A longer interview is observed when a smaller-screen device is used. Acquiescence is more likely to occur when the screen size is large.
⇡ Back to top
C. Coverage and sampling error
8. Use paradata and other auxiliary data to investigate and reduce coverage and sampling errors.
Coverage error is related to the quality of the sampling frame from which respondents are selected. Various auxiliary data, including paradata, can be used to evaluate the quality of sampling frames. One way to study coverage error is to use flag files in the U.S. Postal Service’s Delivery Sequence File, where addresses can be flagged as vacant, institutional, or seasonal . Another common auxiliary data is geocoding data, often used to construct the frame when there is no match between the postal geographies and the census geographies (e.g., the U.S.). The types of paradata and auxiliary data available for coverage error study depend on the frames used and, sometimes, the sampling procedures. In random route sampling, which combines frame construction and sampling into one step, paradata such as contact history can be used to study not only coverage error, but sampling error. See for a detailed discussion.
8.1 Clearly define the research questions regarding the investigation of sampling error.
8.2 Based on the research question, select paradata or other auxiliary data that can indicate coverage error or sampling error. Given the sampling frame, the types of paradata/auxiliary data available may differ. As described by , examples include:
8.2.1 Postal delivery databases: in some countries, centralized postal registers are available and can be used as sampling frames for housing units. Various studies have evaluated the quality of such frames (e.g., ). Sometimes the address is only a mailbox, which works for mail surveys but not face-to-face surveys. In the United States, the U.S. Postal Service’s Delivery Sequence File, which contains auxiliary data such as flags in the file, can be used to construct the frames. The flags indicate whether the address is vacant or is a dormitory or seasonal housing. When geocoding is used to construct sampling frames, it can also generate paradata related to coverage error. For example, the software of geocoding can report how precisely an address is geocoded.
8.2.2 Housing unit listing is often done by interviewers when no postal address information is available. Interviewers are sent to selected areas to list housing units in order to create the frame for future sample selection. Paradata and other auxiliary data are closely related to the process of listing. In dependent listing, interviewers are provided with a map with an initial list of addresses, referred as an input list. Interviewers are asked to delete the inappropriate units and to confirm the correct units on the list . Paradata or other auxiliary data, such as whether a housing unit is from the input list or was added by the interviewer, can be collected for analysis. Interviewer observations on the quality of the list can also be collected in the listing process, indicating possible coverage errors. Much more paradata can be collected when technology is used in the listing process. For example, if a laptop or smartphone device is provided for listing, various paradata such as time spent for listing, keystroke data, and even GPS data can be collected.
8.2.3 Random route sampling: this procedure combines frame construction and sampling. Similar to housing unit listing, if technology is used, time spent for listing and sampling, keystroke data, and GPS data can be collected for studying coverage error and sampling error.
8.2.4 Missed unit procedures: to prevent undercoverage in household surveys, procedures like half-open interval are used to find and select units missing from the sample frame. Flags of these added cases can be collected to evaluate coverage error.
8.2.5 Telephone frames can suffer from overcoverage when randomly selected phone numbers are ineligible units (e.g., business telephone numbers). Undercoverage can occur when some people have no phone numbers. propose a list-assisted methodology where “banks of 100 consecutive numbers are assigned a score reflecting how many numbers in that bank also appear in the directory of listed phone numbers” . Auxiliary data from the fame construction process, such as the bank-level score, can be used to study coverage error.
8.2.6 Household rosters: this step of sampling is essential when a survey unit is an individual, but the frames available are at household unit level. Various methods can be used for within-household selection, such as soonest birthday, oldest male/youngest female, or a random selection based on a full roster of all members of the household . Paradata collected in the roster process, such as those that record the behavior of the interviewers, can be used to detect errors related with household rosters.
8.2.7 Population registers: an alternative method of selecting respondents is to directly use population registers. Few countries have population registers, but for those that do, they are an appealing way for survey researchers to draw a sample. In Sweden, registers include the time that each record is updated, which can be used as an indicator of precision in the records. In some countries, the registers are not centralized. For example, in Germany, each community has its own registers. In this case, interactions with the organizations at each community can be collected to study coverage error .
8.2.8 Subpopulation frames: in some cases, the target population is not the entire population (e.g., a survey of adult females only). Usually, a screening interview is required to filter out ineligible cases when no register data is available. In this case, paradata indicating the process of the screening interviews can be used to study coverage error.
8.2.9 Web surveys: despite their increasing popularity, Web surveys suffer from many coverage errors. Commonly used frames for Web surveys include mail, telephone, and in-person interviews. To help those who do not know how to use computers or access a website, researchers sometimes provide training programs. Auxiliary data indicating which cases are provided with training can be used to study whether including such cases can reduce undercoverage bias .
8.3 Set standards for how these paradata or auxiliary data can be collected. The collection methods may differ based on the survey mode and the types of frames used. For example, some auxiliary data can be obtained from organizations that provide the frames (e.g., the organizations at each community in Germany who keep the registers). Some other auxiliary data including paradata, such as computer-generated paradata, need to be obtained using specific software.
8.4 Collect the paradata or auxiliary data and analyze them.
8.5 Use statistical methods to evaluate coverage or sampling error using paradata or other auxiliary data. Similar to using paradata to investigate measurement errors, statistical models such as survival models or logistic regression models can be employed to evaluate the relationship between paradata (or other auxiliary data) and coverage and sampling error. In addition, based on research questions, paradata/auxiliary data can be used as either independent or dependent variables in the analysis.
8.1 compared the eligibility of cases added by interviewers via a missed unit procedure to those on the original frame. They found that the units added through the missed unit procedure were more likely to be vacant units than cases already on the frame.
8.2 used contact history records and compared cases selected through random walks and through a population register. They found that fewer calls were needed in the random walk sample to complete an interview. It is likely that interviewers select those cases which are easier to reach, which may lead to coverage and sampling error.
8.3 used keystroke files to study the process that interviewers use to take rosters. This study found that sometimes interviewers went back after the respondent selections were finished (e.g., to change the number of household members). The author interprets this as interviewers trying to select a more cooperative household member. Those who are excluded from the rosters are undercovered on the frames of household members.
⇡ Back to top
D. Quality of Paradata and Other Auxiliary data
9. Develop a quality control framework for paradata and other auxiliary data.
Data quality is known have a critical impact on the final survey statistics in research. It is not surprising that results can be biased when the variables used in analysis are subject to measurement error. This is also true with paradata and other auxiliary data. Poor-quality paradata and auxiliary data can lead to biased estimates in post-survey analysis, such as biased nonresponse adjustments . In the data collection process, if paradata or other auxiliary data prone to measurement error is used in a responsive design to help researchers make intervention decisions, it is likely that the interventions will not be as useful as expected, survey cost will be increased instead of decreased, and survey data quality will be diminished . Therefore, developing a quality control framework for paradata/auxiliary data collection and analysis is of critical importance to researchers who would like to make good use of such information.
9.1 Review the characteristics and collection procedures of different types of paradata and auxiliary data, and understand the nature of the paradata/auxiliary data quality.
9.1.1 Computer-generated paradata:
- Very few peer-reviewed studies have examined the quality of computer-generated paradata. In general, it is thought that they are of good quality.
- Although technical issues are rare , they do happen and could lead to failure to collect automatically generated paradata .
- As mentioned in Guideline 6, paradata can be collected on the server side and/or the client side. Note that the quality collected from each side may be different. Take the response time as an example: point out that, unlike client-side paradata, “server-side response time measures are captured by the server sending the Web survey to the respondent’s computer and collecting the data, and therefore include uploading and downloading times.” Thus, for the same questions, the response time data collected through server-side will likely be longer than that collected from client-side. This is consistent with findings from .
- Paradata involving the interaction between respondents and the computer may be prone to measurement error. For example, as mentioned in , problems with response timings are more common in computer-assisted telephone interviews (CATI) using an automatic voice sensor, where the sensors may be ‘tricked’ by respondents asking questions or engaging in other vocal behaviors that do not represent answering a question, such as thinking aloud. Such response time paradata can be prone to measurement error.
9.1.2 Interviewer-generated paradata:
- In computer-assisted interviews, paradata is usually entered directly into the computer system. In paper-and-pencil surveys, however, interviewer-generated paradata is recorded on paper, and then entered into a computerized dataset. Editing error can be introduced during this process.
- Interviewer-recorded paradata can be subject to errors. For example, previous literature found that interviewers are likely to underreport call attempts in CAPI data collection, such as failing to report ‘drive-by’ visits .
9.2 For computer-generated paradata, develop procedures to ensure all programming works as designed and to prevent the occurrence of technical issues. Develop clear protocols on how to construct paradata-derived indicators.
9.3 For interviewer-generated paradata, develop clear protocols regarding how to record paradata, and protocols for coders on how to code interviewer-generated paradata in the dataset. In the training process, provide clear instructions to interviewers and coders on the collection and coding of paradata.
9.4 Develop quality examination procedures for different types of paradata. have introduced different methods of examining the quality of paradata, such as using “‘gold standard’ validation data…, as well as indirect indicators of the quality, including reliability, ease of collection, and missing data issues.” For more information, refer to .
9.5 If any paradata is found to be highly subject to measurement error, provide interventions for future waves or surveys to improve data quality.
9.1 In a study comparing interviewer observations on housing status with respondent self-reports, “accuracy rates for interviewer observations ranged from 46% for privately rented dwellings to 89% for owner-occupied dwellings.” This implies non-negligible measurement error with interviewer observations .
⇡ Back to top
10. Document how paradata and other auxiliary data are collected and the steps used to construct the paradata/auxiliary data-based indicators.
The collection and construction of paradata and other auxiliary data are not usually documented in surveys. Documentation will help give secondary data users a clear picture about the data collection and variable construction process, which will help them to use or analyze paradata/auxiliary data in a more efficient way.
10.1 Document how each type of paradata and auxiliary data is collected, such as whether timestamp data is collected through the server side or the client side.
10.2 If any indicator is constructed based on paradata, such as response time calculated from timestamp data, document clearly how those variables are constructed.
10.3 Document the coding procedure, if there is any, for interviewer-generated paradata, such as open-ended descriptions of the neighborhood.
10.4 If paradata is used in a responsive design, document clearly how paradata is used to monitor and inform intervention in the data collection process.
10.5 Document clearly how the paradata provided can be linked to the main survey data. Provide instructions on how to use the paradata-derived indicators. Ensure that respondent confidentiality is maintained when any paradata files are available for use.
10.1 The provides a detailed documentation for the paradata collected.
10.2 The provides a well-documented introduction of the contact form data they release for many countries.
⇡ Back to top