On Data Integrity and Cleaning

July 30th, 2010

This year I hired a Quickbooks consultant to bring my bookkeeping up from the stone age.  (I had been using excel).

She had asked for some documents with detailed data, and I tried to send her something else as a shortcut.  I thought it was detailed enough. It wasn’t, so she just fudged it. The bottom line was all correct, but the data that put it together was all wrong.

I hit the roof. Internally, only–I realized it was my own fault for not giving her the info she needed.  She did a fabulous job.

But I could not leave the data fudged, even if it all added up to the right amount, and already reconciled. I had to go in and spend hours fixing it. Truthfully, I was a bit of a compulsive nut about it.

And then I had to ask myself why I was so uptight–if accountants think the details aren’t important, why do I? Statisticians are all about approximations and accountants are exact, right?

As it turns out, not so much.

But I realized I’ve had 20 years of training about the importance of data integrity. Sure, the results might be inexact, the analysis, the estimates, the conclusions. But not the data. The data must be clean.

Sparkling, if possible.

In research, it’s okay if the bottom line is an approximation.  Because we’re never really measuring the whole population.  And we can’t always measure precisely what we want to measure.  But in the long run, it all averages out.

But only if the measurements we do have are as accurate as they possibly can be.


Bookmark and Share

Computing Cronbach’s Alpha in SPSS with Missing Data

July 16th, 2010

I recently received this question:

I have scale which I want to run Chronbach’s alpha on.  One response category for all items is ‘not applicable’. I want to run  Chronbach’s alpha requiring that at least 50% of the items must be answered for the scale to be defined.  Where this is the case then I want all missing values on that scale replaced by the average of the non-missing items on that scale. Is this reasonable? How would I do this in SPSS?

My Answer:

In RELIABILITY, the SPSS command for running a Cronbach’s alpha, the only options for Missing Data Read the rest of this entry »

Cohort and Case-Control Studies: Pro’s and Con’s

June 7th, 2010

by Annette Gerritsen, Ph.D.

Two designs commonly used in epidemiology are the cohort and case-control studies. Both study causal relationships between a risk factor and a disease. What is the difference between these two designs? And when should you opt for the one or the other?

Cohort studies

Cohort studies begin with a group of people (a cohort) free of disease. The people in the cohort are grouped by whether or not they are exposed to a potential cause of disease. The whole cohort is followed over time to see if Read the rest of this entry »

Online Workshop Announcement: Calculating Power and Sample Size

June 1st, 2010

Need to do some sample size calculations?

Actually running sample size calculations isn’t terribly hard, if you understand the statistical analyses you’re running them for.  Software is available that makes it pretty easy.  No more power tables (remember those?).  The hardest part is often setting them up.

The first part of setting them up is really, truly pinning down your research hypotheses.  To do a sample size calculation, you need to know what analysis you’ll be doing.  And to Read the rest of this entry »

Clarifications on Interpreting Interactions in Regression

May 17th, 2010

In a previous post, Interpreting Interactions in Regression, I said the following:

In our example, once we add the interaction term, our model looks like:

Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun

Adding the interaction term changed the values of B1 and B2. The effect of Bacteria on Height is now 4.2 + 3.2*Sun. For plants in partial sun, Sun = 0, so the effect of Bacteria is 4.2 + 3.2*0 = 4.2. So for two plants in partial sun, a plant with 1000 more bacteria/ml in the soil would be expected to be 4.2 cm taller than a Read the rest of this entry »

Modeling Whether or When an Event Occurs: Event History Analysis

May 13th, 2010

There are many types of outcome variables that don’t work in linear models, but look like they should. (I mean, specifically, OLS regression and ANOVA models).

They include discrete counts; truncated or censored variables, where part of the distribution is cut off or measured only up to a certain point; and bounded variables, like proportions and percentages.

This article outlines a particular type of outcome variable: one that measures whether or when an event occurs. They are typically called Read the rest of this entry »

Quiz Yourself about Missing Data

May 3rd, 2010

Do you find quizzes irresistible?  I do.

Here’s a little quiz about working with missing data:

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

4. Multiple Imputation is always the best way to deal with missing data.

5. When imputing, it’s important that the imputations be plausible data points.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

7. The worst thing that missing data does is lower sample size and reduce power.

Answers: Read the rest of this entry »

Answers to the Missing Data Quiz

May 3rd, 2010

In my last post, I gave a little quiz about missing data.  This post has the answers.

If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).

True or False?

1. Imputation is really just making up data to artificially inflate results.  It’s better to just drop cases with missing data than to impute.

Answer: False!

Imputation has gotten a bad rap because early imputation methods, like mean imputation, bias your results pretty badly.  And single imputation underestimates standard errors.

But imputation has come a long way, baby!

Multiple imputation, when done well, gives pretty much the same unbiased results, with full power, as the full non-missing data set.

2. I can just impute the mean for any missing data.  It won’t affect results, and improves power.

Answer: False!

As I just said, mean imputation is bad imputation.  It does improve power, but your results will be so biased, the improved power won’t help much.  Sure, your results might be significant, but they’re the wrong results!

3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.

Answer: False!

It’s true that imputing the response doesn’t add any new information to your regression model.  But if you have missing data in the predictors as well,  simultaneously imputing both reponse and predictors improves those predictor imputations.

4. Multiple Imputation is always the best way to deal with missing data.

Answer: False!

It often is, and is a good result.  But it’s not always easy to do well, and it is a large sample technique.

If you’re running a linear or log-linear model, (like a regression or linear mixed model), maximum likelihood techniques give the same great, unbiased, uninflated, full power results that multiple imputation does.

But you don’t have to spend the time and resources imputing anything.

5. When imputing, it’s important that the imputations be plausible data points.

Answer: False!

It’s counter-intuitive, but it’s not actually important that imputations be plausible data points.  The important thing when imputing is that your parameter estimates–your means, regression coefficients, or whatever it is you’re using this data to estimate–be accurate.  Not the imputed data itself.

There are a number of situations, like imputing categorical data, where you actually get better parameter estimates when the imputed data itself aren’t plausible values.

6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.

Answer: False!

It’s not the analysis you’re doing, but the percent, pattern, and randomness of the missing data that determines how problematic missing data are.

Even simple statistics need to be accurate and unbiased.  How important is it that your results are correct?

7. The worst thing that missing data does is lower sample size and reduce power.

Answer: False!

The loss of power from listwise deletion–the default in most software–can be quite devastating.

But even worse are the other two effects of missing data: biased parameter estimates and biased standard errors.  They, in essence, make your results, including p-values, wrong.

And they’re worse than low power because you can’t tell they’re wrong.  If you lose half your sample and have no significant results, you notice.  If the regression coefficients or standard errors aren’t what they’re supposed to be, there’s no way to tell.

That makes it worse in my book.

—————————————————————————————————–

How did you do?  (BTW, it took me years of seminars, reading, and trying things out to figure this all out).

But that’s the reason I developed the Effectively Dealing with Missing Data Without Biasing Your Results workshop. So you don’t have to scrap it all together, like I did.

It starts in a few days, on May 6th. We’ll go over these topics, and more, step-by-step.  By the end of the workshop, you’ll know when and how to impute well, how and when to use maximum likelihood techniques, and when simple, traditional techniques like listwise deletion work just fine.

It’s a web-based workshop, so you can join us from anywhere.  And we offer student and non-profit discounts.

Get the details and register here.

Great Resources for Your Literature Review

April 30th, 2010

by Ursula Saqui, Ph.D.

This is the second post of a two-part series on the overall process of doing a literature review.  Part one discussed the benefits of doing a literature review, how to get started, and knowing when to stop.

You have made a commitment to do a literature review, have the purpose defined, and are ready to get started.

Where do you find your resources?

If you are not in academia, have access to a top-notch library, or receive the industry publications of interest, you may need to get creative if you do not want to pay for each article. (In a pinch, I have paid up to $36 for an article, which can add up if you are conducting a comprehensive literature review!)

Here is where the internet and other community resources can be your best friends.

  • Know the difference between Google and Google Scholar. Google is helpful for popular mainstream publications whereas Google Scholar focuses only on scholarly references such as articles, theses, books, abstracts, and court opinions that are written by academics and other professional scholars.
  • ResearchGATE is an example of a collaborative scientific community that indexes articles. Many times you can find the full text of articles at no charge.
  • Your state may offer access to different databases for its residents. For example, in my home state of Indiana, residents have access to Inspire, a collection of resources, databases, and government publications. Click here to see if your state offers a similar resource.
  • Check your local community library. They may not have the resources you need but they can often get them through inter-library loan. For example, my local community library does not carry advanced statistics books but the librarians can get them for me via their borrowing privileges with universities.
  • Even without access to a specific database, you can search thousands of government sponsored research reports that have been conducted by the U.S. government or one of its affiliates. For example, in completing a literature review of service learning programs, I found a government report that summarized 10 years of research in service learning. (That made my day!)
  • Private foundations or research companies may also conduct high-quality peer-reviewed research. For example, the Robert Wood Johnson Foundation conducts and disseminates research on issues related to health and health care.
  • If you know who authored the article, you can sometimes find a pdf file of their article on their website or university website listed under their vita or recent publications.
  • Try to contact the author directly. When I have contacted authors, they have graciously sent me a complimentary copy of their article.

Still stuck?  Hire someone who knows how to do a good literature review and has access to quality resources.

On a budget?  Hire a student who has access to an academic library.  Many times students can get credit for working on research and business projects through internships or experiential learning programs. This situation is a win-win.  You get the information you need and the student gets academic credit along with exposure to new ideas and topics.

About the Author: With expertise in human behavior and research, Ursula Saqui, Ph.D. gives clarity and direction to her clients’ projects, which inevitably lead to better results and strategies. She is the founder of Saqui Research.

Bookmark and Share

The Literature Review: The Foundation of Any Successful Research Project

April 23rd, 2010

by Ursula Saqui, Ph.D.

This post is the first of a two-part series on the overall process of doing a literature review.  Part two will cover where to find your resources.

Would you build your house without a foundation?  Of course not!  However, many people skip the first step of any empirical-based project–conducting a literature review.  Like the foundation of your house, the literature review is the foundation of your project.

Having a strong literature review gives structure to your research method and informs your statistical analysis.  If your literature review is weak or non-existent, Read the rest of this entry »