In recent years, the number and scope of multinational, multicultural, or multiregional surveys, which we refer to as “3MC” surveys, has increased dramatically. With the increased availability of large datasets covering multiple countries, such as the European Social Survey (ESS) and the Survey of Health, Ageing and Retirement in Europe (SHARE), more researchers have become engaged in analyzing these data (Davidov, Schmidt, & Billiet, 2011). Not surprisingly, there has been increased interest in the development of the statistical tests appropriate to crosscultural survey data analysis. This chapter aims to provide a comprehensive introduction of different statistical methods, from basic statistics to advanced modeling approaches. Note that this chapter does not aim to teach statistics, but rather to provide an overview of what statistical tests are available and when to apply them in 3MC research. We also provide links and references to each statistical method for those who would like additional detail.
1.1 Types of variables
The classification of variable types is important because it will help to determine which statistical procedure should be used. For example, when the dependent variable is continuous, a linear regression can be applied (see Guideline 2.2); when it is categorical (binary), a logistic model can be applied (see Guideline 3.2); when it is categorical (nominal or ordinal), multinomial or ordinal logistic regressions may be used (see Guideline 3.3). If, in latent variable models (see Guideline 6), the latent variable is continuous, Confirmatory Factor Analysis (CFA) or Item Response Theory (IRT) model can be used (see Guidelines 6.1 and 6.5). Table 1 and Table 2 below list the choices of regression and latent variable measurement models, regarding the variable types of the dependent and independent variables.
Several commonly used variable types are listed as below:
 Nominal variables: Variable values assigned to different groups. For example, respondent gender may be “male” or “female”.
 Ordinal variables: Categorical variables with ordered categories. For example, “agree,” “neither agree nor disagree,” or “disagree”.
 Continuous variables: Variables which take on numerical values that measure something. “If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable”(Rovai, Baker, & Ponton, 2013). Continuous variables are understood to have equal intervals between each adjacent pair of values in the distribution. Income is an example of a continuous variable.
 Discrete (ratio) variable: “A discrete variable can only take on a finite value, typically reflected as a whole number” (Randolph & Myers, 2013). The variables have an absolute ‘’ value. One example is the number of children a person has.
Table 1. Variable type and choices of regression models
Table 2. Variable type and choices of latent variable measurement models (see Guideline 6 for more detail)
1.2 The distribution of variables
1.2.1 Graphical illustrations of distributions
It is commonly recommended to look at graphical summaries of both continuous and categorical distributions before fitting any models. Details of the graphical options listed below can be found at this onlie statistics book: Online Statistics Education: An Interactive Multimedia Course of Study.
 For categorical variables:
 Bar graphs
 Pie charts
 For continuous or discrete variables:
 Stem and Leaf Plots
 Histograms
 Box plots
 For any type of variable:
 Frequency distributions
In 3MC data analysis, to get a direct visual comparison, researchers can plot distributions by country or racial group.
1.2.2 Numerical summaries of distributions
A distribution can be summarized with various descriptive statistics. The mean and median capture the center of a distribution (central tendency) while the variance describes the distribution spread or variability (see online book material).
 Mean: the average of a number of values. It is calculated by adding up the values and dividing by the number of the values (how many the values there are).
 Median: The “…median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half” (Reviews, 2013). For a highly skewed distribution, the median may be a more appropriate measure of central tendency than the mean. For example, the median is more widely used to characterize income, since potential outliers (e.g., those with very high incomes) have much more impact on the mean.
 Variance: Variance is a measure of the extent to which a set of numbers are “spread out”.
 Precision: Precision is the reciprocal of the variance and is most commonly seen in Bayesian analysis (see Guideline 9).
1.3 Suggested reading
 Tests of the equality of two means: Online material link
 Van de Vijver and Leung (1997)
 Braun and Johnson (2010)
1.4 Potential uses in 3MC research
 A good starting point of an analysis is to look at the distributions of variables of interests and the graphical illustrations of the variables in each cultural group.
 One way of comparing survey estimates across various cultures is to directly compare mean estimates. A two sample ttest can be used to evaluate the equality of two means (see Guideline 1.3). However, researchers need to be aware that the observed mean differences are not necessarily equal to the latent construct mean differences (see Guideline 6) and direct comparison using observed mean differences may lead to invalid results (see Braun & Johnson, 2010). In addition, factors irrelevant to the question content, such as response style differences in different cultures, may influence the comparability across cultures. More advanced models (such as latent variable models) can be used to evaluate and control for these factors.
2. Simple and Multiple Linear Regression Models
2.1 Bivariate relationships
A bivariate relationship is the relationship between two variables. For example, one may be interested in knowing how height is associated with weight (i.e., whether those who are taller tend to weigh more). Basic information about bivariate relationships can be found here: link.
 Scatterplots
Before running any models, a scatterplot is essential to explore the associations (negative or positive) between variables.
 Correlations between variables
Pearson's correlation is the most commonly used method of evaluating the relationship between two variables. Refer to this website for more information: website link.
Linear regression models can allow researchers to predict one variable using other variable(s). The dependent variable in linear regression models is a continuous variable. Basic information about simple linear and multiple regression models can be found here: link.
 ANOVA table
In the output of regression model results, an Analysis of Variance (ANOVA) table is usually provided. It “consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance” (Filler & DiGabriele, 2012).
 Resource 1 link.
 Resource 2 link.
 An example of a regression model results output using Stata: link.
 Dummy predictor variables
As described by Skrivanek (2009), a dummy variable or indicator variable is an artificial variable created to represent an attribute with two or more distinct categories/levels. If a categorical variable is added to the regression models directly, without being specially specified, the software will treat it as continuous. However, the differences between the categories (e.g., category 2 minus category 1) do not have an actual meaning. Dummy variables are usually created in this situation to make sure that such categorical variables are correctly specified in the model.
For example, in 3MC data analysis, to compare country A to Country B on the level of the dependent variable, one can create a country dummy variable, using one of the countries as a reference group, and add it as an independent variable to the model. When multiple countries exist, one can use one of the countries as the reference category, and treat the variable as categorical in the model Piccinelli and Simon (1997).
 For information on dummy variables and how they are created and used, see Skrivanek, (2009).
 For information on regression models with categorical predicators using SAS, see link.
 Interactions of predictor variables
Sometimes a regression model is used to test whether the relationship between the dependent variable (DV) and one specific independent variable (IV) depends on another IV. To test this, an interaction term between the two IVs can be added to the model.
 Transformations of variables
When nonlinearity is found for predictors, transformations may be considered to “normalize” a variable which has a skewed distribution. For more detail, see LaLonde (2005).
 Lack of fit testing
Various techniques are available to test for the lack of fit in regression models, including visual (e.g., plots) and numerical methods (e.g., and F tests):
 Model diagnostics
Techniques are available to test the appropriateness of the model and whether the model assumptions hold.
 Selecting reduced regression models (variable selection)
Techniques for determining the model which contains the most appropriate independent variables, giving the maximum R2 value.
2.3 Suggested reading
 Applied Statistical Analysis and Data Display: An Intermediate Course with Examples in SPLUS, R, and SAS (Heiberger & Holland, 2004).
 Statistical Methods, 8th ed (Snedecor & Cochran, 1994).
 The Little SAS Book, 4th ed (Delwiche & Slaughter, 2012).
2.4 Potential uses in 3MC research
 As in linear regression models, a country variable / indicator can be added to the regression model as a covariate (e.g., Piccinelli & Simon, 1997).
3.1 Analysis of twoway tables
Categorical data are often displayed in a twoway table. Sometimes, one or both variables are continuous. If so, the continuous variable(s) can be categorized into groups. A twoway table can then be constructed using the new variables. Note that this approach may lead to a loss of information on the continuous variables. See also online marterial.
3.1.1 Pearson chisquare
The Pearson chisquare test evaluates whether the row and column variables in a twoway table are associated.
3.1.2 Odds ratios (OR) and relative risks (RR)
OR and RR describe the proportions in contingency tables. See Sistrom and Garvan (2004) for a comprehensive introduction.
3.1.3 Loglinear models
Loglinear models are commonly used to model the cell counts of contingency tables, such as twoway tables.
Logistic regression models can be used when the dependent variable is a binary categorical variable. The technique allows researchers to model or predict the probability an individual will fall into one specific category, given other independent variables. Logistic regression is a type of generalized linear model, where the logit function of selecting one category is expressed through a linear function of the predictors. Thus, as in other linear regression models, the predictors can include both continuous and categorical variables.
3.3 Multinomial and ordinal logistic regressions
When the DV is a nominal variable, a multinomial logistic regression model can be used. If the DV is an ordinal variable, an ordinal logistic regression can be used.
3.5 Potential uses in 3MC research
 To evaluate responses to a categorical variable across two different cultures, one can construct a twoway table using the categorical variable and the country indicator as the rows and columns. A Pearson chisquare test can be used to evaluate whether the variable differs by cultures.
 As in logistic regression models, a country variable / indicator can be added to the logistic regression model as a covariate.
Multilevel models are usually used when there is a hierarchical structure, such as when sampling units are nested in geographical areas (e.g., cluster sampling) and when they are selected in longitudinal studies. Multilevel models are also known as hierarchical linear models, mixed models, random effects models, and variance components models. The Center for Multilevel modeling at the University of Bristol offers a free online course on multilevel modeling. See this online marterial for more information. Additional information on multilevel modeling can be found at link, and in van de Vijver, van Hemert and Poortinga (2008).
When many cultural groups are present, a multilevel model framework can be used, with country treated as a random variable. Multilevel models with latent variable can also be run, such as multilevel structural equation models (MLSEM), as discussed by Cheung (2006) and Fischer (2009). See Guideline 6.4 for more information on SEM models.
4.1 Suggested reading
 Bryan and Jenkins (2015)
 Gill and Womack (2013)
 Merlo, Chaix, Yang, Lynch, & Råstam (2005)
 West and Galecki (2011)
4.2 Potential uses in 3MC research
 Many 3MC studies have a multilevel data structure with respondents nested within countries. Recent research on multilevel crosscultural research has emerged in last several decades. For more information, see Van de Vijver, van Hemert, and Poortinga (2015).
Longitudinal data analysis refers to techniques used to evaluate data collected through repeated measures.
5.1 Modeling longitudinal / panel data
In panel surveys, respondents are interviewed at multiple points in time, producing “panel data” or “longitudinal data.” The first step in analyzing longitudinal data is to look at the descriptive plots; then select one of several possible methods of analysis. The traditional technique is the repeated measures analysis of variance (rmANOVA), although this has several limitations. More commonly used approaches include multilevel models and marginal models.=
5.1.1 Descriptive plots
The “spaghetti” plot “involves plotting a subject’s values for the repeated outcome measure (vertical axis) versus time (horizontal axis) and connecting the dots chronologically” (Swihart, Caffo, James, Strand, Schwartz, & Punjabi, 2010). Plots can be created at both the individual data level and the mean level. For binary outcomes, proportions can be used to generate the plot for different population groups. In 3MC studies, the plots can be generated for different cultural or country groups.
5.1.2 Repeated measures analysis of variance (rmANOVA)
 For more information, see this online example on rmANOVA.
 The rmANOVA approach is not recommended due to the limitations as mentioned below:
 Subjects missing any data will not be included in the analysis.
 A limited number of covariance structures are allowed.
 Timevarying covariates are not allowed.
5.1.3 Multilevel models for longitudinal data
Multilevel models account for between respondent variance by including random effects in the model, such as random slope and random intercept.
 Resource 1 link.
 Resrouce 2 link.
 Steele (2008) and presentation slides at link.
5.1.4 Marginal modeling approaches
If the between subject variation is not of interest, the marginal modeling approach, where only the correlated error terms are included in the model, can be used – no random effects are added to the model.
5.2 Suggested reading
 Chatfield (2013)
 Ballinger (2004)
 Steele (2008)
 Singer (1998)
 West (2009)
 Halekoh, Højsgaard, and Yan (2006)
 Kreuter and Muthen (2008)
 Twisk (2004)
5.3 Potential uses in 3MC research
 A country variable / indicator can be added to the marginal models as a covariate, or it can be added in a multilevel model as a fixed effect.
Latent variable models include both observed variables (the data) and latent variables. A latent variable is unobserved, which represents hypothetical constructs or factors (Kline, 2011). A latent variable can be measured by several observed variables. An example of latent variable provided by Kline (2011) describes construct of intelligence. As mentioned by Kline (2011), “there is no single, definitive measure of intelligence. Instead, researchers use different types of observed variables, such as tasks of verbal reasoning or memory capacity, to assess various facets of intelligence.” Examples of such latent variables are usually measured in a measurement model, which evaluates the relationship between latent variables and their indicators.
6.1 Exploratory Factor Analysis and Confirmatory Factor Analysis
Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) are two types of measurement models, where latent variables are indicated by multiple observed variables. The difference between EFA and CFA is related whether you have a hypothesis about the measurement model before doing the analysis. As mentioned by Yong and Pearce (2013), “CFA attempts to confirm hypotheses and uses path analysis diagrams to represent variables and factors, whereas EFA tries to uncover complex patterns by exploring the dataset and testing predictions”. As seen in Table 3, since both the indicators and the latent variables are all continuous, both EFA and CFA are based on linear functions. Table 3 shows the differences between EFA and CFA. Since EFA is purely datadriven which may be arbitrary in nature, it is thus suggested by some literature to always use CFA, which is theorydriven, rather than EFA (Sansone, Morf, & Panter, 2004). As mentioned by Sansone et al. (2004), in selecting items, it is more appropriate to use EFA rather than CFA, when the theory is not well established.
See Yong and Pearce (2013) for a comprehensive overview on EFA and Brown (2015) for CFA. The code for conducting EFA and CFA are included in the Appendix A.
Table 3. Comparisons between EFA and CFA*
Adapted from Exploratory and Confirmatory Factor Analysis presentation at link.
Multigroup CFA (MCFA) is commonly used in 3MC research for measurement equivalence testing. The basic idea is to start with the same model but allow the coefficients differ by groups (assuming configural equivalence), and then start introducing constrains in the model coefficients – such as to make them equal across the groups. Then, the model fit of the previously run models can be compared. Among all the models, the parsimonious model with a good fit solution will be selected to evaluate the data. If the model reveals no violations of scalar equivalence, the country means can be compared directly. In a panel study, with data available at different time points, one can also evaluate measurement equivalence across cultures over time. See Guideline 6.2 below for more information on measurement equivalence testing.
For more information of MCFA, see:
6.2 Measurement equivalence in 3MC research
As mentioned by Kankaraš and Moors (2010), “measurement equivalence implies that a same measurement instrument used in different cultures measures the same construct.” There are different levels of equivalence. Three most widely discussed levels are: configural, metric and scalar equivalence. These three levels are hierarchical, where the higher ones have higher requirements of equivalence, and require the achievement of the lower ones (Kankaraš & Moors, 2010).
Configural equivalence refers to similar construction of the latent variable. In other words, same indicators are associated with the latent concepts in each culture. It does not require each culture view the concept in the same way. For example, it allows the strength (i.e., loadings) to be different across cultures. Metric equivalence requires same slope across cultures which capture the associations between indicator and the latent variable. In other words, it implies “the equality of the measurement units or intervals of the scale on which the latent concept is measured across cultural groups” (Kankaraš & Moors, 2010; Steenkamp & Baumgartner, 1998). Scalar equivalence implies that on the basis of equality of the measurement units, the scales of the latent variable also have the same origin across cultures (Kankaraš & Moors, 2010). Under this equivalence level, the model achieves full measurement equivalence, and researchers can compare the country scores (i.e., country scores) directly.
In situations where full equivalence is difficult to achieve, researchers also evaluate the conditions under which different cultures achieve partial equivalence. An example of partial equivalence is when most of the indicators are equivalent across cultures, but only one has a different slope and thresholds across cultures. One can then conclude that the different cultures achieve partial equivalence, where they differ on one specific indicator. As mentioned by Kankaraš and Moors (2010), “partial equivalence enables a researcher to control for a limited number of violations of the equivalence requirements and to proceed with substantive analysis of crosscultural data” (Kankaraš & Moors, 2010; Steenkamp & Baumgartner, 1998)..
The aforementioned approaches to assessing measurement equivalence have been widely used in 3MC survey analysis. However, it has recently been criticized for being overly strict. As mentioned by Davidov et al. (2015), it is difficult to achieve scalar equivalence or even metric equivalence in surveys with many countries or cultural groups. A Bayesian approximate equivalence testing approach has been recently proposed by Davidov et al. (2015). This approach allows “small variations” in parameters across different cultural groups (Davidov et al., 2015). Thus, when approximate scalar measurement equivalence is reached, one can compare across cultures meaningfully, even though the traditional method may indicate scalar inequivalence (Davidov et al., 2015). For introductions and references of Bayesian methods, see Guideline 9.
6.3 Latent Class Analysis (LCA)
Unlike the previously mentioned approach, such as CFA and SEM, where the latent variables are continuous, LCA treats the latent variables as categorical – nominal or ordinal (see Table 3). The categories of the latent variable in LCA are referred to as classes, which represent “a mixture of subpopulations where membership is not known but is inferred from the data” (Kline, 2011). That is to say, LCA can classify respondents into different groups based on their attitudes or behaviors, such as classifying respondents by their drinking behavior. Respondents in the same group are similar to each other, regarding the behavior / attitudes, and they differ from those in other groups – i.e., heavy drinkers vs. nonalcoholic drinkers. One can also add covariates in the model if those measures can influence the class membership. In a secondstep, the class membership from the model can then be used for followup analysis. For example, to better understand the differences between respondents, a logistic (or multinomial logistic, if more than two groups) regression model can be run in which selected covariates are used to predict the class membership. Or, to evaluate the influence of the class membership on other variables, LCA can also be used in regression models as a covariate to predict other outcomes.
For more information on LCA, please see:
As mentioned by Kankaraš, Moors, and Vermunt (2010), when testing for measurement invariance with latent class analysis, “the model selection procedure usually starts by determining the required number of latent classes or discrete latent factors for each group. … If the number of classes is the same across groups, then the heterogeneous model is fitted to the data; followed by a series of nested, restricted models which are evaluated in terms of model fit”. That is to say, unlike multigroup CFA, the multigroup LCA will need to identify if the number of classes are the same across groups, before testing for models at different invariance levels. See Kankaraš, Moors, and Vermunt (2010), Eid, Langeheine, and Diener (2003), and Kankaras and Moors (2009) for more information.
6.4 Structural Equation Modeling (SEM)
Structural Equation Modeling is a multivariate analysis technique used in many disciplines, which aims to test the causal relationship hypothesis between variables (Holbert & Stephenson, 2008). It usually includes two components: 1) the measurement model which summarizes several observed variables using their latent construct (e.g., CFA as discussed in Guideline 6.1) and the structural model, which describes the relationship between multiple constructs (e.g., relationships among both latent and observed variables).
6.4.1 Variables
Similar to previously discussed latent variable models, SEM can have both observed and latent variables, where observed variables are the data collected from respondents and latent variables represent unobserved construct and factors (Kline, 2011). The observed variables which are used as measures of a construct are indicators of the latent variable. In other words, the latent variable is indicated by these observed variables.
Besides observed and latent variables, SEM models also include error terms, similar to the error terms in a regression analysis. As mentioned by Kline (2011), “a residual term represents variance unexplained by the factor that the corresponding indicator is supposed to measure. Part of this unexplained variance is due to random measurement error, or score unreliability”.
6.4.2 Analysis of Covariance Structure
In SEM analysis, the parameter estimation is done by comparing the modelbased covariance matrix with the databased covariance matrix. The goal of this approach is to evaluate whether the model with best fit is supported by the data—that is, whether the two covariance matrixes are consistent with each other and whether the model can explain as much of the variance of the data.
6.4.3 Means of latent variables
Structural equation models can also estimate the means of latent variables. It also allows researchers to analyze between and withingroup mean differences (Kline, 2011). In 3MC analysis, one can estimate the group mean differences on latent variables, such as between two cultures.
6.4.4 Suggested Reading
6.5 Item Response Theory (IRT)
IRT is commonly used for psychometric and educational testing, and is becoming more popular in 3MC analysis. It begins with the idea that when answering a specific question, the response provided by an individual depends on the ability / qualities of the individual and the qualities of the question item. As mentioned by Ostini and Nering (2005), “The mathematical foundation of IRT is a function that relates the probability of a person responding to an item in a specific manner to the standing of that person on the trait that the item is measuring. In other words, the function describes, in probabilistic terms, how a person with a higher standing on a trait (i.e., more of the trait) is likely to provide a response in a different response category to a person with a low standing on the trait.” Therefore, IRT allows researchers to model the probability of a specific response to a question item, given the item and the individual’s trait level.
The simplest IRT model is the Rasch model, also called oneparameter IRT model, which assumes equal item discrimination (“the extent to which the item is able to distinguish between individuals on the latent construct” (Chan, 2000). This model starts from the premise that the probability of giving a “positive” answer to a yes/no question is “a logistic function of the distance between the item’s location, also referred to as item difficulty, and the person’s location on the construct being measured”, also known as the person’s latent trait level (Mneimneh, Heeringa, Tourangeau & Elliott, 2014). There are other types of IRT models available. They can be categorized by the number of parameters and the question response option format, such as binary or multiple response options and whether ordinal or nominal. Table 4 below summarizes different types of IRT models.
In a twoparameter (2PL) IRT model, an item discrimination parameter is also included in the model. The parameters are “analogous” to the factor loadings in CFA and EFA, since they all represent “the relationship between the latent trait and item responses” (Brown, 2015). A threeparameter (3PL) IRT model also includes a “guessing” parameter. It describes the situation that when a question can be answered by guessing, the probability of giving a correct answer is higher than zero even for those with low latent trait level.
For items with multiple response options (ordinal or nominal variables), polytomous IRT models can be used. See Table 4 for details. In this chapter, we will not discuss these models in detail. See suggested readings on IRT models for more information.
Table 4. Variable type and IRT model choices. 

Type of observed variable  Model 
Binary 
1 parameterlogistic model (1  PL model) / Rasch model 
2  PL model  
3  PL model  
Multiple response options (Ordinal) 
Graded Response model / Thurstone/Samejima polytomous models 
Partial Credit model (PCM) & Graded PCM  
Multiple response options (Nominal) 
Rating Scale model 
Nominal response model / Bock’s model 
6.5.1 Suggested Reading
 Ostini and Nering (2005)
 Davidov, Schmidt, and Billiet (2011)
 Hambleton, Swaminathan, and Rogers (1991)
 Van der Linden & Hambleton (2013)
 Nering & Ostini (2011)
 Mneimneh, Heeringa, Tourangeau and Elliott (2014)
6.6 Other types of latent variable models
Besides what is discussed above, other types of latent variable models are available. Some examples are listed below.
 Latent Transition Model
Latent Transition Model is “a special kind of latent class factor model that represents the shift from one of two different states, such as from nonmastery to mastery of a skill, is a latent transition model” (Kline, 2011). See online material for more information.
 Latent Profile Model
In latent profile models, the latent variable is categorical and the indicators are continuous. It is commonly used for cluster analysis. See Vermunt (2004) for more information.
 Mixed Rasch Model
Mixed Rasch model is “a combination of the polytomous Rasch model with latent class analysis” (Quandt, 2011). See Quandt (2011) for more information.
 Multilevel Structural Equation Modeling (MLSEM) When we have data where the population of individuals are divided into different groups, such as in a 3MC context, a Multilevel Structural Equation Modeling (MLSEM) can be used. This model decomposes individual data into within group and between group components, and can simultaneously estimate of within and between group models (Muthén and Muthén, 2007). For more information on MLSEM, see RabeHesketh, Skrondal, & Zheng (2007).
6.7 Potential uses in 3MC research
 As mentioned by Steinmetz (2011), the observed mean does not equal to the latent mean, where the observed mean is a function of item intercepts, factor loadings and the latent mean. Similarly, “observed mean differences between two or more groups (e.g., cultures) do not necessarily indicate latent mean differences as unequal intercepts and/or factor loadings will also lead to observed differences” (Steinmetz, 2011). To conduct more valid comparisons across different groups (e.g., cultures), measurement invariance testing is a widely used method, which aims to evaluate whether the latent means of various groups are comparable. In other words, it evaluates whether the different groups differ in factor loadings and intercepts of the measures. See Steinmetz (2011) for more information.
 Measurement invariance testing is usually conducted within the multigroup analysis (MGA) framework. The most commonly used is multigroup confirmatory factor analysis (MGCFA) (e.g., Steinmetz, 2011). Other types of MGA include multigroup structural equation modeling (MGSEM) analysis (e.g., Meuleman & Billiet, 2011), multigroup latent class analysis (e.g., Kankaraš, Vermunt, & Moors, 2011), multigroup IRT model (e.g., Janssen, 2011) and multigroup mixed rasch model (e.g., Quandt, 2011). See Davidov et al. (2011) for more information.
A recent paper by Welzel and Inglehart discusss the misconceptions in measurement equivalence analysis. Using data from World Value Survey, they show that “constructs can entirely lack convergence at the individual level and nevertheless exhibit powerful and important linkages at the aggregate level” (Welzel & Inglehart, 2016).
7. Differential Item Functioning (DIF)
Differential Item Functioning (DIF) is a statistical concept developed to identify to what extent the question item might be measuring different properties for individuals of separate groups, such as ethnicity, culture, region, language, sex or other demographic groups. It can be used as indicators for “item bias” if the items function in a systematic different way across cultures. To detect DIF, different methods can be used, as listed below:
 Mantel Haenszel. MantelHaenszel (MH) statistic is regarded as a “reference” technique of detecting DIF due to its ease of use and the fact that it can be applied to small samples (Padilla, Hidalgo, Benítez, & GómezBenito, 2012). The disadvantage of MH statistic is that it does not allow statistical significance  testing.
 Logistic regression. Logistic regression can be used as an alternative method to detect DIF. For more information, see Clauser, Nungester, Mazor, & Ripkey (1996).
 Techniques based on IRT models. DIF can be detected using an IRT framework. Item characteristic curves (ICCs) of the same item but from different groups can be compared to evaluate whether there is DIF. For more information, see Thissen, Steinberg, & Wainer (1993) and Zumbo (2007).
Machine learning is “a general term for a diverse number of classification and prediction algorithms” (Lee, Lessler, & Stuart, 2010) which has applications in many different fields. Unlike statistical modeling approaches, machine learning evaluates the relationship between outcome variable and predictors using a “learning algorithm without an a priori model” (Lee et al., 2010). Below we introduce several machine learning methods.
8.1 Classification tree
The Classification tree is a datadriven method which allows researchers to evaluate the complex interaction between variables when there are many predictor variables present. In binary trees, the nodes of the tree are divided into two branches. To reasonably construct and prune a given tree, deviance measure is used to choose the splits. In R, “rpart” package is used for classification tree analysis (see Appendix A for more information). The classification tree result can be evaluated through apparent error rate and true error rate. The former one is the error rate when the tree is applied to a training data set, and the latter one is when it is applied to a new data set or a test data. In evaluating the true error rate, researchers usually divide the data into two parts: training data and test data, and validate the tree based on the test data set.
8.2 Random forest
Random forest is an algorithm for classification which uses an “ensemble” of classification trees (DíazUriarte & De Andres, 2006). Through averaging over a large ensemble of “lowbias, highvariance but low correlation trees”, the algorithm yields an ensemble that can achieve both “low bias and low variance” (DíazUriarte & De Andres, 2006).
8.3 Suggested reading
 Resource 1 link.
 Resource 2 link.
 Ledolter (2013)
 Lemon, Roy, Clark, Friedmann, and Rakowski (2003)
 Lewis and Street (2000)
 Loh (2014)
8.4. Potential use in 3MC research
 Classification tree analysis in crosscultural research allows researchers to evaluate 1) the important factors for each culture, and 2) how the factor interactions differ across cultures. One study used classification tree to evaluate college student alcohol consumptions across American and Greek students, and found that “student attitudes toward drinking were important in the classification of American and Greek drinkers” (Kitsantas, Kitsantas, & Anagnostopoulou, 2008).
9. Incorporate complex survey data features
It is usually difficult to draw a simple random sample from the population, due to cost and practical considerations such as no comprehensive sampling frame available. As discussed in Sample Design, complex samples, such as surveys involving stratified / cluster sample design, are commonly used in surveys. In a simple random sample, one can assume that observations are independent from each other. However, in a complex sample design, such as multistage samples of schools, classes and students, students from one classroom are likely to be more correlated than those from another classroom. Therefore, as described in Sample Design, in the analysis phase, we need to compensate for complex survey designs with features including, but not limited to, unequal likelihoods of selection, differences in response rates across key subgroups, and deviations from distributions on critical variables found in the target population from external sources, such as a national Census, most commonly through the development of survey weights for statistical adjustment. If complex sample designs are implemented in data collection but the analysis assumes simple random sampling, the variances of the survey estimates can be underestimated and the confidence interval and test statistics are likely to be biased (Heeringa, West, & Berglund, 2010).
In a recent metaanalysis of 150 sampled research papers analyzing several surveys with complex sampling designs, it is found that analytic errors caused by ignorance or incorrect use of the complex sample design features were frequent. Such analytic errors define an important component of the larger total survey error framework, produce misleading descriptions of populations and ultimately yield misleading inferences (Aurelien, West, & Sakshaug, 2016). It is thus of critical importance to incorporate the complex survey design features in statistical analysis.
For many of the aforementioned statistical models, various statistical software programs have enabled the analysis of complex survey data features, such as “svy” statement in Stata, and SURVEY procedures in SAS. See Appendix A for more information.
9.1 Suggested Reading:
 Heeringa, West, and Berglund (2010)
 Carle (2009)
 RabeHesketh and Skrondal (2006)
 Stapleton (2006)
 Valliant, Dever, and Kreuter (2013)
10 Introduction to Bayesian Inference
This section presents an overview of the Bayesian Theory, which follows closely the overview of Lee (2012), Barendse, Albers, Oort, & Timmerman (2014), and Kaplan & Depaoli (2013). In surveys, respondents’ answers, denoted as , reflects our measure of the true population’s – a random variable takes on a realized value . In other words, is unoberseved, and the probability distribution is of researchers’ interests. We use to denote a parameter which reflects the characteristics of the distribution of . For example, can be the mean of the distribution. The goal is to estimate the unknown parameter based on the data, which is . Based on Bayes’ theorem, where is the probability distribution of the data, which is known for researchers, refers to the probability of the data given the unknown parameter , and is the prior distribution of the parameters. is thus referred to as the posterior distribution of the parameter given the data, which is also the results of the model.
In summary, Bayesian methods use both the prior information (which indicates the distribution of parameters) and the distribution of data to estimate the model results – the posterior distributions of the parameters. The key difference between Bayesian and frequentist approach is relates with the unknown parameter . In frequentist approach, is viewed as unknown but fixed. On the other hand, in Bayesian approach, is random, which has a posterior distribution taking into account the uncertainty of .
10.1 Priors
There are generally two types of priors, noninformative and informative priors. The choice between the two types depends on our confidence about how much information we have about the priors and how accurate they are. Noninformative priors are also referred to as “vague” or “diffuse” priors. It is used when there is little information about the priors, and thus its influence on the posterior distribution of is minimal (Lee, 2012). An example of a noninformative prior can be a density with a huge variance, so that the Bayeisan estimation is mainly affected by the data. Informative priors are used when we have sufficient information about the priors, such as from knowledge of experts and similar data set.
10.2 Bayesian model comparison
There are multiple Bayesian model comparison statistics. Two most commonly used are Bayes factor and DIC. The Bayes factor quantifies the odds that the data favor one hypothesis over another. As discussed in Guideline 5, Bayes factors are not well defined when using noninformative priors (Berg, Meyer, & Yu, 2004), and the evaluations can be computationally difficult (Lee, 2012). DIC compromises both goodness of fit and model complexity. In practical applications, the model with the smaller DIC value is preferred.
10.3 Credible interval
When we have estimated the posterior distributions of the parameters, we would like summaries of the distribution, such as mean and variance, for hypothesis testing. One important way to evaluate the distribution is based on the credible interval, which is often referred to as a similar measure as the “confidence interval” in frequentist approach. Credible interval is based the quantiles of the posterior distributions. Based on the quantiles, we can directly evaluate the probability that the parameter lies in a particular interval. When this probability is 0.95, it is referred to as 95% credible interval. If the credible intervals from two models do not overlap, we say that the two posterior distributions of this parameter differ.
10.4 Markov Chain Monte Carlo (MCMC) Methods
MCMC is the most common computational algorithms for Bayesian methods. It generates Markov Chains, which simulate the posterior distribution. The basic idea is that by simulating a sufficiently large number of observations from the posterior distribution, , we can approximate the mean and other summary statistics of the distribution. The use of MCMC for posterior simulation in latent variable models is to treat the latent variables as missing data, which enables the augmentation of the observed variables. The most common MCMC algorithm is the Gibbs sampler, which performs on alternating conditional sampling at each of its iteration. More specifically, it draws each component conditional on the values of all the other components (Lee, 2012). In a Markov Chain, early proportion of the chain which may not converge to target distribution is called burnin.
10.5 Convergence diagnostics
Multiple convergence diagnostics exist. In practice, it is common to inspect several different diagnostics, since there is no single adequate assessment. One of the most common statistics in a multiplechain condition is the Gelman and Rubin diagnostic (Gelman & Rubin, 1992), which compares the withinchain and betweenchain variance. A value above 1.1 is an indication of lack of convergence. The common diagnostics for single chain condition include the Geweke (1992) convergence diagnostic, and the Raftery and Lewis (1992) convergence diagnostic, which can help to decide how many iterations needed, and how many can be treated as burnin in a longenough chain.
10.6 Suggested reading
 Davidov et al. (2015)
 Lee and Song (2012)
 Fox (2010)
 Stone and Zhu (2015)
 Muthén and Asparouhov (2012)
10.7 Potential uses in 3MC research
 As previously mentioned in Guideline 6, approximate Bayesian measurement equivalence approach can be used for crosscultural comparison research (e.g., Davidov et al., 2015; Bolt, Lu, & Kim, 2014). See suggested readings in Guideline 9.6 above for more information.