Posts Tagged ‘Missing Data’

Computing Cronbach’s Alpha in SPSS with Missing Data

Friday, July 16th, 2010

I recently received this question:

I have scale which I want to run Chronbach’s alpha on.  One response category for all items is ‘not applicable’. I want to run  Chronbach’s alpha requiring that at least 50% of the items must be answered for the scale to be defined.  Where this is the case then I want all missing values on that scale replaced by the average of the non-missing items on that scale. Is this reasonable? How would I do this in SPSS?

My Answer:

In RELIABILITY, the SPSS command for running a Cronbach’s alpha, the only options for Missing Data (more…)

Answers to the Missing Data Quiz

Monday, May 3rd, 2010

In my last post, I gave a little quiz about missing data.  This post has the answers.

If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

Answer: False!

Imputation has gotten a bad rap because early imputation methods, like mean imputation, bias your results pretty badly.  And single imputation underestimates standard errors.

But imputation has come a long way, baby!

Multiple imputation, when done well, gives pretty much the same unbiased results, with full power, as the full non-missing data set.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

Answer: False!

As I just said, mean imputation is bad imputation.  It does improve power, but your results will be so biased, the improved power won’t help much.  Sure, your results might be significant, but they’re the wrong results!

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

Answer: False!

It’s true that imputing the response doesn’t add any new information to your regression model.  But if you have missing data in the predictors as well,  simultaneously imputing both reponse and predictors improves those predictor imputations.

4. Multiple Imputation is always the best way to deal with missing data.

Answer: False!

It often is, and is a good result.  But it’s not always easy to do well, and it is a large sample technique.

If you’re running a linear or log-linear model, (like a regression or linear mixed model), maximum likelihood techniques give the same great, unbiased, uninflated, full power results that multiple imputation does.

But you don’t have to spend the time and resources imputing anything.

5. When imputing, it’s important that the imputations be plausible data points.

Answer: False!

It’s counter-intuitive, but it’s not actually important that imputations be plausible data points.  The important thing when imputing is that your parameter estimates–your means, regression coefficients, or whatever it is you’re using this data to estimate–be accurate.  Not the imputed data itself.

There are a number of situations, like imputing categorical data, where you actually get better parameter estimates when the imputed data itself aren’t plausible values.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

Answer: False!

It’s not the analysis you’re doing, but the percent, pattern, and randomness of the missing data that determines how problematic missing data are.

Even simple statistics need to be accurate and unbiased.  How important is it that your results are correct?

7. The worst thing that missing data does is lower sample size and reduce power.

Answer: False!

The loss of power from listwise deletion–the default in most software–can be quite devastating.

But even worse are the other two effects of missing data: biased parameter estimates and biased standard errors.  They, in essence, make your results, including p-values, wrong.

And they’re worse than low power because you can’t tell they’re wrong.  If you lose half your sample and have no significant results, you notice.  If the regression coefficients or standard errors aren’t what they’re supposed to be, there’s no way to tell.

That makes it worse in my book.

—————————————————————————————————–

How did you do?  (BTW, it took me years of seminars, reading, and trying things out to figure this all out).

But that’s the reason I developed the Effectively Dealing with Missing Data Without Biasing Your Results workshop. So you don’t have to scrap it all together, like I did.

It starts in a few days, on May 6th. We’ll go over these topics, and more, step-by-step.  By the end of the workshop, you’ll know when and how to impute well, how and when to use maximum likelihood techniques, and when simple, traditional techniques like listwise deletion work just fine.

It’s a web-based workshop, so you can join us from anywhere.  And we offer student and non-profit discounts.

Get the details and register here.

Multiple Imputation: 5 Recent Findings that Change How to Use It

Wednesday, March 24th, 2010

Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.

The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.

Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis (more…)

Online Workshop Announcement: Missing Data

Monday, March 22nd, 2010

I probably don’t need to tell you about what missing data does to your analysis.

If you have any experience with missing data, you know it really messes things up.  The thing is, it’s not a data issue like skewness or non-normality that you can just ignore.  It’s going to affect your analysis.  Ignoring it still means choosing a way of dealing with missing data–but you’re using the default method.

Depending on which statistical software you’re using, and the patterns and percentage of missing data, the default may or may not be a perfectly acceptable way of dealing with the missing data.

But in data analysis, it’s always better if you understand the defaults, know what they’re doing, and decide for yourself if it’s the best approach.

And up until about 10 years ago, there weren’t many other options.  There was listwise deletion and there was imputation.  But many of the imputation methods were pretty sketchy.  So it was a “damned if you do, damned if you don’t” kind of situation.

But it’s different now.

In August 1999, just a month after I started at the Statistical Consulting office at Cornell, I saw a talk by Joe Schaefer at the Joint Statistical Meetings about multiple imputation.  I was blown away.  It seemed too good to be true–it solved pretty much all of the problems with missing data.

So I read all that I could, attended a week-long mini-class, and tried it all out.

It turns out at that time, you had to use special stand-alone software to implement it, and all the ones I tried were a bit clunky to use.

Luckily, statistical software has caught up.  And in that time, a few new studies have shown that some of the restrictive assumptions of multiple imputation aren’t as restrictive as they at first seemed.  So it’s easier and more accurate than ever.

It’s also become clear that some of those old methods aren’t always as horrible as they seemed–there are some situations when listwise deletion works just fine.

But it pays to know the difference, and how to implement not just multiple imputation, but maximum likelihood approaches, which also give great outcomes and are a bit easier to use.

So I am once again offering an online workshop on missing data:  Effectively Dealing with Missing Data Without Biasing Your Results.  It includes 6 hours of instruction, 2 hours of Q&A, and we’ll go through all the approaches for dealing with missing data in detail:

  • what they are
  • the advantage and disadvantages of each
  • how to implement them in various statistical software
  • the data and analysis situations when it’s best to each one
  • how to figure out which situations you have

If you have any questions, feel free to contact me.  You can get more details and register here.  It begins May 6, 2010.


Bookmark and Share

New version released of Amelia II: A Program for Missing Data

Tuesday, June 30th, 2009

A new version of Amelia II, a free package for multiple imputation, has just been released today.  Amelia II is available in two versions.  One is part of R, and the other, AmeliaView, is a GUI package that does not require any knowledge of the R programming language.  They both use the same underlying algorithms and both require having R installed.

At the Amelia II website, you can download Amelia II (did I mention it’s free?!), download R, get the very useful User’s Guide, join the Amelia listserve, and get information about multiple imputation.

If you want to learn more about multiple imputation:

3 Ad-hoc Missing Data Approaches that You Should Never Use

Monday, June 15th, 2009

The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis.  It also goes under the names case deletion and complete case analysis.

Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations.  By works well, I mean it fits 3 criteria:

- gives unbiased parameter estimates

- gives accurate (or at least conservative) standard error estimates

- results in adequate power.

But not always.  So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data.  Although each solved some problems of listwise deletion, they created others.  All three have been discredited in recent years and should NOT be used.  They are:

Pairwise Deletion: use the available data for each part of an analysis.  This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.

Mean Imputation: substitute the mean of the observed values for all missing data.  There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.

Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable.  Use both variables in the analysis.  While it does help the loss of power, it usually leads to biased results.

There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software.  There is no reason to continue to use ad hoc techniques that create more problems than they solve.

Diagnosing Missing Data: A new way to graph missingness

Thursday, June 4th, 2009

Some approaches to missing data work well in some situations, but perform very poorly in others.  So it’s really important to get a good idea of the type and pattern of missingness in your data.  You may even take different missing data approaches to different variables.

Matt Blackwell of the Harvard Social Science Statistics blog has come up with a nice way to visualize the missingness patterns in a data set.  (I’m a big fan of graphing data to understand it).  He calls it a Missingness Map.

The only drawback seems to be that it will be cumbersome for large data sets.

Missing Data: Criteria for Choosing an Effective Approach

Wednesday, May 20th, 2009

In choosing an approach to missing data, there are a number of things to consider.  But you need to keep in mind what you’re aiming for before you can even consider which approach to take.

There are three criteria we’re aiming for with any missing data technique:

1. Unbiased parameter estimates:  Whether you’re estimating means, regressions, or odds ratios, you want your parameter estimates to be accurate representations of the actual population parameters.  In statistical terms, that means the estimates should be unbiased.  If all the (more…)

Join me in Finding Good Solutions to Missing Data

Monday, May 18th, 2009

About 10 years ago, when I first started consulting, I had a client, Linda, who had a lot of data missing from her data set for her master’s thesis.  She had a pretty big model–about 15 predictors.  And while no one variable was missing more than 5 or 10% of the data, in combination, listwise deletion was getting rid of more than half the cases.  She wasn’t getting any significant results because of the huge loss of power, and with that many dropped cases, it wasn’t clear that she still had a random sample that gave her unbiased results.

At that point, modern approaches to dealing with missing data did exist, but they were just beginning to become available in specialized software.  Neither Linda nor I had learned about them in statistics classes, because they just hadn’t hit the mainstream yet.  With a lot of research and a lot of learning (more…)

Five Advantages of Running Repeated Measures ANOVA as a Mixed Model

Wednesday, May 13th, 2009

There are two ways to run a repeated measures analysis. The traditional way is to treat it as a multivariate test–each response is considered a separate variable. The other way is to it as a mixed model. While the multivariate approach is easy to run and quite intuitive, there are a number of advantages to running a repeated measures analysis as a mixed model.

First I will explain the difference between the approaches, then briefly describe some of the advantages of using the mixed models approach. (more…)