Quiz Yourself about Missing Data

May 3rd, 2010

Do you find quizzes irresistible?  I do.

Here’s a little quiz about working with missing data:

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

4. Multiple Imputation is always the best way to deal with missing data.

5. When imputing, it’s important that the imputations be plausible data points.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

7. The worst thing that missing data does is lower sample size and reduce power.

Answers: Read the rest of this entry »

Answers to the Missing Data Quiz

May 3rd, 2010

In my last post, I gave a little quiz about missing data.  This post has the answers.

If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

Answer: False!

Imputation has gotten a bad rap because early imputation methods, like mean imputation, bias your results pretty badly.  And single imputation underestimates standard errors.

But imputation has come a long way, baby!

Multiple imputation, when done well, gives pretty much the same unbiased results, with full power, as the full non-missing data set.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

Answer: False!

As I just said, mean imputation is bad imputation.  It does improve power, but your results will be so biased, the improved power won’t help much.  Sure, your results might be significant, but they’re the wrong results!

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

Answer: False!

It’s true that imputing the response doesn’t add any new information to your regression model.  But if you have missing data in the predictors as well,  simultaneously imputing both reponse and predictors improves those predictor imputations.

4. Multiple Imputation is always the best way to deal with missing data.

Answer: False!

It often is, and is a good result.  But it’s not always easy to do well, and it is a large sample technique.

If you’re running a linear or log-linear model, (like a regression or linear mixed model), maximum likelihood techniques give the same great, unbiased, uninflated, full power results that multiple imputation does.

But you don’t have to spend the time and resources imputing anything.

5. When imputing, it’s important that the imputations be plausible data points.

Answer: False!

It’s counter-intuitive, but it’s not actually important that imputations be plausible data points.  The important thing when imputing is that your parameter estimates–your means, regression coefficients, or whatever it is you’re using this data to estimate–be accurate.  Not the imputed data itself.

There are a number of situations, like imputing categorical data, where you actually get better parameter estimates when the imputed data itself aren’t plausible values.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

Answer: False!

It’s not the analysis you’re doing, but the percent, pattern, and randomness of the missing data that determines how problematic missing data are.

Even simple statistics need to be accurate and unbiased.  How important is it that your results are correct?

7. The worst thing that missing data does is lower sample size and reduce power.

Answer: False!

The loss of power from listwise deletion–the default in most software–can be quite devastating.

But even worse are the other two effects of missing data: biased parameter estimates and biased standard errors.  They, in essence, make your results, including p-values, wrong.

And they’re worse than low power because you can’t tell they’re wrong.  If you lose half your sample and have no significant results, you notice.  If the regression coefficients or standard errors aren’t what they’re supposed to be, there’s no way to tell.

That makes it worse in my book.

—————————————————————————————————–

How did you do?  (BTW, it took me years of seminars, reading, and trying things out to figure this all out).

But that’s the reason I developed the Effectively Dealing with Missing Data Without Biasing Your Results workshop. So you don’t have to scrap it all together, like I did.

It starts in a few days, on May 6th. We’ll go over these topics, and more, step-by-step.  By the end of the workshop, you’ll know when and how to impute well, how and when to use maximum likelihood techniques, and when simple, traditional techniques like listwise deletion work just fine.

It’s a web-based workshop, so you can join us from anywhere.  And we offer student and non-profit discounts.

Get the details and register here.

Great Resources for Your Literature Review

April 30th, 2010

by Ursula Saqui, Ph.D.

This is the second post of a two-part series on the overall process of doing a literature review.  Part one discussed the benefits of doing a literature review, how to get started, and knowing when to stop.

You have made a commitment to do a literature review, have the purpose defined, and are ready to get started.

Where do you find your resources?

If you are not in academia, have access to a top-notch library, or receive the industry publications of interest, you may need to get creative if you do not want to pay for each article. (In a pinch, I have paid up to $36 for an article, which can add up if you are conducting a comprehensive literature review!)

Here is where the internet and other community resources can be your best friends.

  • Know the difference between Google and Google Scholar. Google is helpful for popular mainstream publications whereas Google Scholar focuses only on scholarly references such as articles, theses, books, abstracts, and court opinions that are written by academics and other professional scholars.
  • ResearchGATE is an example of a collaborative scientific community that indexes articles. Many times you can find the full text of articles at no charge.
  • Your state may offer access to different databases for its residents. For example, in my home state of Indiana, residents have access to Inspire, a collection of resources, databases, and government publications. Click here to see if your state offers a similar resource.
  • Check your local community library. They may not have the resources you need but they can often get them through inter-library loan. For example, my local community library does not carry advanced statistics books but the librarians can get them for me via their borrowing privileges with universities.
  • Even without access to a specific database, you can search thousands of government sponsored research reports that have been conducted by the U.S. government or one of its affiliates. For example, in completing a literature review of service learning programs, I found a government report that summarized 10 years of research in service learning. (That made my day!)
  • Private foundations or research companies may also conduct high-quality peer-reviewed research. For example, the Robert Wood Johnson Foundation conducts and disseminates research on issues related to health and health care.
  • If you know who authored the article, you can sometimes find a pdf file of their article on their website or university website listed under their vita or recent publications.
  • Try to contact the author directly. When I have contacted authors, they have graciously sent me a complimentary copy of their article.

Still stuck?  Hire someone who knows how to do a good literature review and has access to quality resources.

On a budget?  Hire a student who has access to an academic library.  Many times students can get credit for working on research and business projects through internships or experiential learning programs. This situation is a win-win.  You get the information you need and the student gets academic credit along with exposure to new ideas and topics.

About the Author: With expertise in human behavior and research, Ursula Saqui, Ph.D. gives clarity and direction to her clients’ projects, which inevitably lead to better results and strategies. She is the founder of Saqui Research.

Bookmark and Share

The Literature Review: The Foundation of Any Successful Research Project

April 23rd, 2010

by Ursula Saqui, Ph.D.

This post is the first of a two-part series on the overall process of doing a literature review.  Part two will cover where to find your resources.

Would you build your house without a foundation?  Of course not!  However, many people skip the first step of any empirical-based project–conducting a literature review.  Like the foundation of your house, the literature review is the foundation of your project.

Having a strong literature review gives structure to your research method and informs your statistical analysis.  If your literature review is weak or non-existent, Read the rest of this entry »

A Sneak Peak at SPSS 19

April 21st, 2010

At times, SPSS seems to come out with new versions faster than rabbits.

And while sometimes that’s annoying (especially when your organization won’t upgrade their site license, even though all your collaborators’ organizations did), SPSS has added some really great functionality to recent versions.

From a researcher’s point of view, the best one in version 17 was adding Multiple Imputation to the Missing Values Analysis module.  It’s an add-on module, so many site licenses don’t contain it, but I encourage you to lobby your site license administrator to ante up for it.  (No, I don’t get any kickbacks for mentioning this).  It has the best missing data diagnoses of any software I’ve used, but missing multiple imputation (pun intended) was a major disadvantage.

Version 18 had more subtle improvements, like new nonparametric tests.

But IBM is planning some exciting additions to version 19, most notably Generalized Read the rest of this entry »

Steps to Take When Your Regression (or Other Statistical) Results Just Look…Wrong

April 19th, 2010

You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But they just look…wrong. Backwards, or even impossible—theoretically or logically.

This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.

Errors in Data Coding and Entry

In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which Read the rest of this entry »

Multiple Comparisons in Nonparametric Tests

April 5th, 2010

I received received a question about controlling for inflated Type I error through Bonferroni corrections in nonparametric tests.  Here’s the specific question and my quick answer:

My colleague is applying non parametric (Kruskal-Wallis) to check for differences between groups. There are 12 groups and test showed that there is significant difference in the groups. However, to check which pair is significant is tedious and I’m not sure if there is comparable post-hoc test in non-parametric approach. Any resources available in hands?

My answer:

Bonferroni correction is your only option when applying non-parametric statistics (that I’m aware of). Or, actually, any test other than ANOVA.

A Bonferroni correction is actually very simple.  Just take the number of comparisons you want to make, then multiply each p-value by that number.   If the calculated p-value is greater than 1, round to 1.0.


Bookmark and Share

Multiple Imputation: 5 Recent Findings that Change How to Use It

March 24th, 2010

Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.

The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.

Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis Read the rest of this entry »

Online Workshop Announcement: Missing Data

March 22nd, 2010

I probably don’t need to tell you about what missing data does to your analysis.

If you have any experience with missing data, you know it really messes things up.  The thing is, it’s not a data issue like skewness or non-normality that you can just ignore.  It’s going to affect your analysis.  Ignoring it still means choosing a way of dealing with missing data–but you’re using the default method.

Depending on which statistical software you’re using, and the patterns and percentage of missing data, the default may or may not be a perfectly acceptable way of dealing with the missing data.

But in data analysis, it’s always better if you understand the defaults, know what they’re doing, and decide for yourself if it’s the best approach.

And up until about 10 years ago, there weren’t many other options.  There was listwise deletion and there was imputation.  But many of the imputation methods were pretty sketchy.  So it was a “damned if you do, damned if you don’t” kind of situation.

But it’s different now.

In August 1999, just a month after I started at the Statistical Consulting office at Cornell, I saw a talk by Joe Schaefer at the Joint Statistical Meetings about multiple imputation.  I was blown away.  It seemed too good to be true–it solved pretty much all of the problems with missing data.

So I read all that I could, attended a week-long mini-class, and tried it all out.

It turns out at that time, you had to use special stand-alone software to implement it, and all the ones I tried were a bit clunky to use.

Luckily, statistical software has caught up.  And in that time, a few new studies have shown that some of the restrictive assumptions of multiple imputation aren’t as restrictive as they at first seemed.  So it’s easier and more accurate than ever.

It’s also become clear that some of those old methods aren’t always as horrible as they seemed–there are some situations when listwise deletion works just fine.

But it pays to know the difference, and how to implement not just multiple imputation, but maximum likelihood approaches, which also give great outcomes and are a bit easier to use.

So I am once again offering an online workshop on missing data:  Effectively Dealing with Missing Data Without Biasing Your Results.  It includes 6 hours of instruction, 2 hours of Q&A, and we’ll go through all the approaches for dealing with missing data in detail:

  • what they are
  • the advantage and disadvantages of each
  • how to implement them in various statistical software
  • the data and analysis situations when it’s best to each one
  • how to figure out which situations you have

If you have any questions, feel free to contact me.  You can get more details and register here.  It begins May 6, 2010.


Bookmark and Share

Free Webinar today: Principal Component Analysis

March 17th, 2010

The Next Craft of Statistical Analysis Webinar* is today: Principal Component Analysis

Principal Component Analysis is a variable reduction procedure–it allows you summarize the common variation in many variables into just a few. It’s similar to Factor Analysis, but has different underlying assumptions and meanings of the components.

This webinar will summarize what it is, when to use it, how it differs from Factor Analysis, and briefly demonstrate the 5 steps to conducting a Principal Component Analysis.

Date: Wednesday, March 17, 2010

Time: 1pm Eastern Time (GMT-4)  NOTE: We just switched to Daylight Savings in the US.  If you’re outside the US, note we’re off an hour.

Where: Anywhere you have a fast internet connection

Length of Program: An Hour

Cost: Always FREE

Register at: http://www.analysisfactor.com/learning/webinar15.html

What’s a Craft of Statistical Analysis Webinar?  It’s a regular webinar series for researchers to help you hone the craft of statistical analysis.  Each webinar is about a single statistical topic that is often confusing, misunderstood, or not well known to researchers.  Check it out and pass the word along–they’re free!


Bookmark and Share