There are there mechanisms for missing data . The difference between the three mechanisms depends on the relationship of the variable of interest to the missing observations and the variables available to explain the missingness.
- Missing completely at random (MCAR):
- This missing data mechanism assumes the underlying process causing missing data are uncorrelated with any of the variables in the dataset. In other words, the probability of an observation for variable \(y\) being missing does not depend on measurements (\(x\) or \(y\) in the diagram below) in the dataset itself. An example of MCAR data is missing data due to an instrument malfunction. If MCAR holds, listwise deletion (i.e., an entire record is excluded from analysis if any one value is missing) can be employed, because the available cases constitute a random subsample. Therefore, under MCAR, valid inferences to the target population can be made when analyzing only those units with complete data. If there are variables in the dataset (\(x\), \(y\)) that help predict the missing values, the assumption does not hold. MCAR rarely holds, and, thus, listwise deletion will seldom be appropriate.
- The concept of MCAR is illustrated below where \(y\) is the variable of interest with missing values, \(x\) is a predictor of \(y\), \(m\) is the process causing missingness, and \(q\) is a variable not in the dataset.
- Missing at random (MAR):
- MAR is a weaker assumption about missingness than MCAR. In MAR, the process causing missing values can be explained by observed, non-missing data (\(x\) in the diagram below) other than the variable of interest (\(y\)). The probability of data missing on variable \(y\) is not related to the value of \(y\), controlling for other variables. For data that are MCAR or MAR, the missing data mechanism is deemed ignorable. Note that the missing data mechanism is what is ignorable, not the missing data themselves. For data that are MAR, imputation will reduce bias.
- The concept of MAR is illustrated below where \(y\) is the variable of interest with missing values and \(x\) is a predictor of \(y\) and also can predict the mechanism for missing values, \(m\). \(q\) is auxiliary to the dataset and also predicts \(m\).
- Missing not at random (MNAR):
- For data that are MNAR, even after controlling for other observed variables in the dataset (\(x\) in the diagram below), the reason for a variable \(y\) having missing observations still depends on the unseen observations of \(y\) itself. One example of data that could be MNAR is reported income. Individuals with either high or low incomes can be reluctant to report how much they earn. If this is true, the probability of obtaining a measure of a person’s income will depend upon the amount the person earns. Nonignorable nonresponse creates data that are MNAR and, hence, a method of imputation that accounts for this is necessary.
- In the diagram below, \(y\) is the variable of interest with missing values, \(x\) is a predictor of \(y\) in the dataset, and \(q\) is unobserved auxiliary data. The three variables \(y\), \(x\), and \(q\) all predict \(m\), the mechanism of missing values.