Posts Tagged ‘maximum likelihood’

Answers to the Missing Data Quiz

Monday, May 3rd, 2010

In my last post, I gave a little quiz about missing data.  This post has the answers.

If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

Answer: False!

Imputation has gotten a bad rap because early imputation methods, like mean imputation, bias your results pretty badly.  And single imputation underestimates standard errors.

But imputation has come a long way, baby!

Multiple imputation, when done well, gives pretty much the same unbiased results, with full power, as the full non-missing data set.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

Answer: False!

As I just said, mean imputation is bad imputation.  It does improve power, but your results will be so biased, the improved power won’t help much.  Sure, your results might be significant, but they’re the wrong results!

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

Answer: False!

It’s true that imputing the response doesn’t add any new information to your regression model.  But if you have missing data in the predictors as well,  simultaneously imputing both reponse and predictors improves those predictor imputations.

4. Multiple Imputation is always the best way to deal with missing data.

Answer: False!

It often is, and is a good result.  But it’s not always easy to do well, and it is a large sample technique.

If you’re running a linear or log-linear model, (like a regression or linear mixed model), maximum likelihood techniques give the same great, unbiased, uninflated, full power results that multiple imputation does.

But you don’t have to spend the time and resources imputing anything.

5. When imputing, it’s important that the imputations be plausible data points.

Answer: False!

It’s counter-intuitive, but it’s not actually important that imputations be plausible data points.  The important thing when imputing is that your parameter estimates–your means, regression coefficients, or whatever it is you’re using this data to estimate–be accurate.  Not the imputed data itself.

There are a number of situations, like imputing categorical data, where you actually get better parameter estimates when the imputed data itself aren’t plausible values.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

Answer: False!

It’s not the analysis you’re doing, but the percent, pattern, and randomness of the missing data that determines how problematic missing data are.

Even simple statistics need to be accurate and unbiased.  How important is it that your results are correct?

7. The worst thing that missing data does is lower sample size and reduce power.

Answer: False!

The loss of power from listwise deletion–the default in most software–can be quite devastating.

But even worse are the other two effects of missing data: biased parameter estimates and biased standard errors.  They, in essence, make your results, including p-values, wrong.

And they’re worse than low power because you can’t tell they’re wrong.  If you lose half your sample and have no significant results, you notice.  If the regression coefficients or standard errors aren’t what they’re supposed to be, there’s no way to tell.

That makes it worse in my book.

—————————————————————————————————–

How did you do?  (BTW, it took me years of seminars, reading, and trying things out to figure this all out).

But that’s the reason I developed the Effectively Dealing with Missing Data Without Biasing Your Results workshop. So you don’t have to scrap it all together, like I did.

It starts in a few days, on May 6th. We’ll go over these topics, and more, step-by-step.  By the end of the workshop, you’ll know when and how to impute well, how and when to use maximum likelihood techniques, and when simple, traditional techniques like listwise deletion work just fine.

It’s a web-based workshop, so you can join us from anywhere.  And we offer student and non-profit discounts.

Get the details and register here.

Missing Data: Criteria for Choosing an Effective Approach

Wednesday, May 20th, 2009

In choosing an approach to missing data, there are a number of things to consider.  But you need to keep in mind what you’re aiming for before you can even consider which approach to take.

There are three criteria we’re aiming for with any missing data technique:

1. Unbiased parameter estimates:  Whether you’re estimating means, regressions, or odds ratios, you want your parameter estimates to be accurate representations of the actual population parameters.  In statistical terms, that means the estimates should be unbiased.  If all the (more…)

Join me in Finding Good Solutions to Missing Data

Monday, May 18th, 2009

About 10 years ago, when I first started consulting, I had a client, Linda, who had a lot of data missing from her data set for her master’s thesis.  She had a pretty big model–about 15 predictors.  And while no one variable was missing more than 5 or 10% of the data, in combination, listwise deletion was getting rid of more than half the cases.  She wasn’t getting any significant results because of the huge loss of power, and with that many dropped cases, it wasn’t clear that she still had a random sample that gave her unbiased results.

At that point, modern approaches to dealing with missing data did exist, but they were just beginning to become available in specialized software.  Neither Linda nor I had learned about them in statistics classes, because they just hadn’t hit the mainstream yet.  With a lot of research and a lot of learning (more…)

EM Imputation and Missing Data: Is Mean Imputation Really so Terrible?

Wednesday, April 15th, 2009

I’m sure I don’t need to explain to you all the problems that occur as a result of missing data.  Anyone who has dealt with missing data—that means everyone who has ever worked with real data—knows about the loss of power and sample size, and the potential bias in your data that comes with listwise deletion.

Listwise deletion is the default method for dealing with missing data in most statistical software packages.  It simply means excluding from the analysis any cases with data missing on any variables involved in the analysis.

A very simple, and in many ways appealing, method devised to overcome these problems is mean imputation. Once again, I’m sure you’ve heard of it–just plug in the mean for that variable for all the missing values. The nice part is the mean isn’t affected, and you don’t lose that case from the analysis. And it’s so easy! SPSS even has a little button to click to just impute all those means.

But there are new problems. True, the mean doesn’t change, but the relationships with other variables do. And that’s usually what you’re interested in, right? Well, now they’re biased. And while the sample size remains at its full value, the standard error of that variable will be vastly underestimated–and this underestimation gets bigger the more missing data there are. Too-small standard errors lead to too-small p-values, so now you’re reporting results that should not be there.

There are other options. Multiple Imputation and Maximum Likelihood both solve these problems. But while Multiple Imputation is not available in all the major stats packages, it is very labor-intensive to do well. And Maximum Likelihood isn’t hard or labor intensive, but requires using structural equation modeling software, such as AMOS or MPlus.

The good news is there are other imputation techniques that are still quite simple, and don’t cause bias in some situations. And sometimes (although rarely) it really is okay to use mean imputation. When?

If your rate of missing data is very, very small, it honestly doesn’t matter what technique you use. I’m talking very, very, very small (2-3%).

There is another, better method for imputing single values, however, that is only slightly more difficult than mean imputation. It uses the E-M Algorithm, which stands for Expectation-Maximization. It is an interative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). If not, it re-imputes a more likely value. This goes on until it reaches the most likely value.

EM imputations are better than mean imputations because they preserve the relationship with other variables, which is vital if you go on to use something like Factor Analysis or Regression. They still underestimate standard error, however, so once again, this approach is only reasonable if the percentage of missing data are very small (under 5%) and if the standard error of individual items is not vital (as when combining individual items into an index).

The heavy hitters like Multiple Imputation and Maximum Likelihood are still superior methods of dealing with missing data and are in most situations the only viable approach. But you need to fit the right tool to the size of the problem. It may be true that backhoes are better at digging holes than trowels, but trowels are just right for digging small holes. It’s better to use a small tool like EM when it fits than to ignore the problem altogether.

EM Imputation is available in SAS, Stata, R, and SPSS Missing Values Analysis module. 

If you want to learn more about missing data and the different approaches for learning it, sign up for my next training teleseminar: Approaches to Missing Data: The Good, the Bad, and the Unthinkable. It’s free.

Upcoming 2 Day Course on Missing Data

Friday, November 14th, 2008

I’m  passing on the following announcement about an upcoming Missing Data short course by Paul Allison, who is a brilliant author and speaker on statistical methodology. I’ve read many of his books, although I’ve never attended one of his workshops. I’ve talked with people who have, however, and I’ve heard he’s fabulous. So if you’re going to be near D.C. in December, it’s worth a look:

Dr. Paul Allison will offer his two-day course, Missing Data, on December 5-6 in Washington, DC.

The course provides an in-depth look at modern methods for handling missing data, with emphasis on maximum likelihood and multiple imputation. These methods have been demonstrated to be markedly superior to conventional methods like listwise deletion or single imputation, while at the same time resting on less stringent assumptions.

Dr. Allison is Professor of Sociology at the University of Pennsylvania. He is the author of five books on using statistics, including Missing Data. He is a former Guggenheim Fellow and a recipient of the Lazarsfeld Award for Distinguished Contributions to Sociological Methodology.

You can get more detailed information about the course at www.StatisticalHorizons.com

The Second Problem with Mean Imputation

Thursday, October 2nd, 2008

A previous post discussed the first reason to not use mean imputation as a way of dealing with missing data–it does not preserve the relationships among variables.

A second reason is that any type of single imputation underestimates error variation in any statistic that used the imputed data.  Because the imputations are themselves estimates, there is some error associated with them.  But your statistical software doesn’t know that.  It treats it as real data.

Ultimately, because your standard errors are too low, so are your p-values.  Now you’re making Type I errors without realizing it.

A better approach?  Mulitple Imputation or Full Information Maximum Likelihood.

Two Recommended Solutions for Missing Data: Mulitple Imputation and Maximum Likelihood

Tuesday, November 30th, 1999

Two methods for dealing with missing data,vast improvements over traditional approaches, have become available in mainstream statistical software in the last few years.

Both of the methods discussed here require that the data are missing at random–not related to the missing values. If this assumption holds, resulting estimates (i.e., regression coefficients and standard errors) will be unbiased with no loss of power.

The first method is Multiple Imputation (MI). Just like the old-fashioned imputation (more…)