There are two methods for
dealing with missing data that have become available in mainstream
statistical software in the last few years. These two methods are vast
improvements over traditional approaches, as described in Limitations to
Common Approaches to Missing Data. This article outlines these two
methods.
Both of the methods discussed
here require that the missing data mechanism is ignorable, that is, not
related to the missing values (see Missing Data Mechanisms). If
the mechanism is ignorable, resulting estimates (i.e., regression
parameters and standard errors) will be unbiased with no loss of power.
The first method is Multiple
Imputation (MI). Just like the imputation methods discussed in Limitations to
Common Approaches to Missing Data, Multiple Imputation fills in
estimates for the missing data. However, to capture the uncertainty in
those estimates, MI imputes the values multiple times. Because it uses
an imputation method with error built in, the multiple estimates should
be similar, but not identical. The result is multiple data sets with
identical values for all of the non-missing values and slightly
different values for the imputed values in each data set. The
statistical analysis of interest, such as ANOVA or logistic regression,
is performed separately on each data set, and the results are then
combined. Because of the variation in the imputed values, there should
also be variation in the parameter estimates, leading to appropriate
estimates of standard errors and appropriate p-values.
Multiple Imputation is
available in SAS, S-Plus, and Solas. In SAS, PROC MI creates the
multiple data sets, which can then be easily analyzed separately using
standard statistical procedures. PROC MIANALYZE will then combine the
results from these separate analyses. Joe Schafer at Penn State has
developed four S-Plus libraries for multiple imputing normal,
categorical, mixed, and panel data. He has made the library for normal
data available as a free stand-alone package called NORM. Multiple
Imputation is also available in Solas, but its algorithms have been
questioned as inappropriate, and we cannot recommend its use at this
time.
The second method is to analyze
the full, incomplete data set using maximum likelihood estimation. This
method does not impute any data, but rather uses all data observed for
each case to compute maximum likelihood estimates. The maximum
likelihood estimate of a parameter is the value of the parameter that
is most likely to have resulted in the observed data. When data are
missing, we can factor the likelihood function. The likelihood is
computed separately for those cases with complete data on some
variables and those with complete data on all variables. These two
likelihoods are then maximized together to find the estimates. Like
multiple imputation, this method gives unbiased parameter estimates and
standard errors. One advantage is that it does not require the careful
selection of variables used to impute values that Multiple Imputation
requires. It is, however, limited to linear models.
Analysis of the full,
incomplete data set using maximum likelihood estimation is available in
AMOS. AMOS is a structural equation modeling package, but it can run
multiple linear regression models. AMOS is easy to use and is now
integrated into SPSS, but it will not produce residual plots, influence
statistics, and other typical output from regression packages. The
missing value analysis package in SPSS will do some very limited
maximum likelihood estimates for means and correlations only.
Monthly Tips, Resources, and News.PlusThe Top Resources for Learning 13 Statistical Methods. And our free Webinar Series: The Craft of Statistical Analysis. FREE!