Free Webinar: Understanding Mediation and Path Analysis

January 26th, 2010

The Next Craft of Statistical Analysis Webinar* is tomorrow: Understanding Mediation and Path Analysis

Path Analysis is a system of regression equations used to determine if a third variable (a mediator) is driving the relationship between an independent and dependent variable. It is one of the simplest forms of structural equation models (SEM), but you don’t need specialized SEM software to run it.

This webinar will give an overview of the concepts, terminology, and steps involved in detecting mediation using three regression equations.  We’ll cover the difference between Mediators, control variables, and moderators.  They’re all different!

Date: Wednesday, January 27, 2010

Time: 1pm Eastern Time (12pm Central, 11am Mountain, 10am Pacific)

Where: Anywhere you have a fast internet connection

Length of Program: An Hour

Cost: Always FREE

Register at: http://www.analysisfactor.com/learning/webinar14.html

What’s a Craft of Statistical Analysis Webinar?  It’s a regular webinar series for researchers to help you hone the craft of statistical analysis.  Each webinar is about a single statistical topic that is often confusing, misunderstood, or not well known to researchers.  Check it out and pass the word along–they’re free!


Bookmark and Share

Statistical Workshop Announcements: Complex Surveys, Hierarchical Models, Survival Analysis, Categorical Data Analysis, and Factor Analysis

January 22nd, 2010

The announcements have begun for statistical workshops this summer.  Here’s the first.*

2010 Summer Quantitative Method Series at Portland State University

June 11-12. Secondary Data and Complex Survey Design, Clyde Dent & Nathalie Huguet.
June 14-15. Hierarchical Linear Models and Their Applications, Jason T. Newsom.
June 16-17. Introduction to Survival Analysis with SPSS, Jong-Sung Kim.
June 18-19 Categorical Data Analysis for Social Science. Hyeyoung Woo.
June 21-22. Introduction to Factor Analysis and Structural Equation Modeling. Mo Wang.

More information and online registration:
http://www.upa.pdx.edu/IOA/newsom/SQMS/

This series is comprised of two-day courses on data analysis taught by nationally recognized methodological experts. Course descriptions and more information about instructors can be found at the website.
The goal of the Series is to provide additional statistical and
methodological training for research professionals from either the private or public sector. Although course credit is not available, graduate students are welcome and offered a discounted fee.

Participants may enroll in courses separately or in combination.

Each course takes an applied perspective with special attention given to when and how to implement each technique. Statistical, mathematical, and conceptual foundations will be included with the objective of providing a solid introduction to each area. All courses will provide extensive software illustrations, and, unless otherwise specified, will provide computer lab time where participants have one-on-one assistance available when running computer examples. Some graduate-level coursework in statistics (social science departments or otherwise) and some experience with one or more statistical software packages are usually assumed.

Individual courses may require additional prerequisite knowledge as indicated, however.

All classes will be held at the Portland State University Campus
located in beautiful downtown Portland, OR. The campus is within easy walking distance of many local restaurants and attractions such as the Portland Art Museum, the Portland Farmer´s Market, brewpubs, and wine bars.

Early registration deadline is June 1, 2010.

*Karen here: I’m happy to pass along announcements of any workshops that I think may be of interest to my readers.

Disclaimer: These are not an endorsement (I don’t know these people), and I don’t get any kickbacks. I’m just spreading the news.  My opinion is you can’t have too much statistics learning.


Bookmark and Share

What Makes a Statistical Analysis Wrong?

January 21st, 2010

One of the most anxiety-laden questions I get from researchers is whether their analysis is “right.” I’m always slightly uncomfortable with that word because often there is no one right analysis.

It’s like finding Mr. or Ms. Right—most of the time, there is not just one Right. But there are many that are clearly Wrong.

Luckily, what makes an analysis right for your analysis is more easily defined than what makes a person right for you. It pretty much comes down to two things: whether the assumptions of the statistical method are being met and whether the analysis answers the research question.

Assumptions are very important. A test needs to reflect the scale of the variables, the study design, and issues in the data. A repeated measures study design requires a repeated measures analysis. A binary dependent variable requires a categorical analysis method.

But within those general categories, there are often many analyses that meet assumptions. A logistic regression or a chi-square test can both handle a binary dependent variable if there is only a single categorical predictor. But a logistic regression can also incorporate covariates, directly test interactions, and calculate predicted probabilities. A chi-square test can do none of these.

So you get different information from different tests. They answer different research questions.

An analysis that is correct from an assumptions point of view is totally useless if it doesn’t answer the research question. A data set can spawn an endless number of statistical tests (and you can spend an endless number of days running them) that don’t answer the research question. And the real bummer is it’s not always clear that the analyses aren’t relevant until you sit down to write up the research paper.

That’s why writing out the research questions in theoretical and operational terms is the first step of any statistical analysis. It’s absolutely fundamental. And I mean writing them in minute detail. Issues of mediation, interaction, subsetting, control variables, et cetera, should all be blatantly obvious in the research questions.

The part on writing results sections in Daryl Bem’s chapter “Writing the Empirical Journal Article” is an excellent resource for planning a data analysis. It contains the best examples I’ve ever seen on how to write testable research questions. Thinking about how to write results before solidifying the research questions ensures the analysis is able to answer the questions. Whether the answer is what you expected or not is a different issue.

So when you are concerned about getting an analysis “right,” clearly define the design, variables, and data, but most importantly, get explicitly clear about what you want to learn from this analysis. Once you’ve done this, it’s much easier to find the statistical methods that answers the research questions and meets assumptions.


Bookmark and Share

The Distribution of Independent Variables in Regression Models

January 19th, 2010

While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.

It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.

But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it Read the rest of this entry »

Answers to the Interpreting Regression Coefficients Quiz

January 16th, 2010

Yesterday I gave a little quiz about interpreting regression coefficients.  Today I’m giving you the answers.

If you want to try it yourself before you see the answers, go here.  (It’s truly little, but if you’re like me, you just cannot resist testing yourself).

True or False?

1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA.

Answer: False!

In an ANOVA table (even the one in the regression output), categorical variables are Effect Coded.  Because of that, the main effects remain main effects, and are evaluated independent of interactions.

But in the Regression Coefficients table, unless you are explicitly effect coding, they will be Dummy Coded.  The coefficient for what looks like a main effect IS NOT a main effect.  It’s a marginal effect–the effect of that predictor ONLY when the other predictor in the interaction =0!  I kid you not.

You can get a little more info in this post or a lot more in this video or a whole lot more in the workshop.

2. The intercept is usually meaningless in a regression model.

Answer: False!

This statement is only true if all predictors are continuous and the data don’t contain 0.  If continuous predictors are centered and/or if there are dummy variables in the model, the intercept is meaningful and important.

3. In Analysis of Covariance, the covariate is a nuisance variable, and the real point of the analysis is to evaluate the means after controlling for the covariate.

Answer: False!

It can be true, but it doesn’t have to be.  Covariates are often important predictors that just happen to be observed and continuous.  The only way to evaluate them is to examine their coefficients.

4. Standardized regression coefficients are meaningful for dummy-coded predictors.

Answer: False!

This one is never ever true.  Just because your software lets you get away with it doesn’t mean it’s meaningful.

5. The only way to evaluate an interaction between two independent variables is to categorize one or both of them.

Answer: False!

Sure, it’s tricky to interpret interactions between two continuous variables, but by no means is it impossible or theoretically incorrect.  (And centering really helps).

—————————————————————————————————–

How did you do?  (BTW, it took me years of figuring all this stuff out in a way that was really intuitive, even after many stats classes).

But this is why I developed the Interpreting (Even Tricky) Regression Coefficients workshop. It starts on January 19th. We’ll go over these topics, and more, step-by-step.

Get the details and register here.


Bookmark and Share

Interpreting (Even Tricky) Regression Coefficients Workshop

January 15th, 2010

Here’s a little quiz:

True or False?

1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA.

2. The intercept is usually meaningless in a regression model.

3. In Analysis of Covariance, the covariate is a nuisance variable, and the real point of the analysis is to evaluate the means after controlling for the covariate.

4. Standardized regression coefficients are meaningful for dummy-coded predictors.

5. The only way to evaluate an interaction between two independent variables is to categorize one or both of them.

Answers: Read the rest of this entry »

Making Dummy Codes Easy to Keep Track of

January 14th, 2010

Here’s a little tip.

When you construct Dummy Variables, make it easy on yourself  to remember which code is which.  Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results.

Make the codes inherent in the Dummy variable name.

So instead of a variable named Gender with values of 1=Female and 0=Male, call the variable Female.

Instead of a set of dummy variables named MaritalStatus1 with values of 1=Married and 0=Single, along with MaritalStatus2 with values 1=Divorced and 0=Single, name the same variables Married and Divorced.

And if you’re new to dummy coding, this has the extra bonus of making the dummy coding intuitive.  It’s just a set of yes/no variables about all but one of your categories.

————————————————————————————————-
Bookmark and Share

Interpreting Regression Coefficients in Models other than Ordinary Linear Regression

January 5th, 2010

Someone who registered for my upcoming Interpreting (Even Tricky) Regression Models workshop asked if the content applies to logistic regression as well.

The short answer: Yes

The long-winded detailed explanation of why this is true and the one caveat:

One of the greatest things about regression models is that they all have the same set up:

The left hand side is the response, Y.  It gets an i subscript because each individual has its own value of Y.

The Xs are the predictor variables.  And the βs, the coefficients, tell you about how each X relates to Y, in the context of the presence of the other predictors.  This is the part we really want to find out.

The residuals, ε, are the part of Y that doesn’t relate to the Xs.  They’re important to the model because if we misrepresent how they behave, it means we are also misrepresenting the βs.  But as long as we get a picture of their behavior right, we can make good inferences about how Y relates to the Xs.

So this is the actual model for an ordinary least squares linear regression.  The left hand side of the equation is just Y and ε, the error term, has a normal distribution.

For other types of regression models, like logistic regression, Poisson regression, or multilevel models, all the βs and Xs stay the same.  The only parts that can differ:

1. Instead of Y on the left, there can be a function of Y–a non-linear transformation.

2. Instead of a normal distribution, the residuals can have another distribution.

So for example, in a logistic regression, the function of Y is a logit (a.k.a. log-odds) function and the distribution of ε is binomial.

And in a multilevel model, there is no special transformation of Y, but the residual gets split into two pieces, both of which are normally distributed.

But as I said, the βs and the Xs don’t change.

The interpretation of each coefficient (at least the trickiest parts, which have to do with the Xs) is about two things:

1. the structure of X

2. what other Xs are in the model.

So interpreting coefficients is done the same basic way regardless of how Y and ε behave.  And that’s what we cover in the majority of the workshop–centering, dummy variables, interactions, correlated predictors, etc.

The one caveat:

But logistic regression also involves the transformation of the dependent variable, so there is an extra step involved in interpreting logistic regression coefficients.  So you still need to understand the centering, dummy variables, etc., but you need to understand the logit transformation as well.

This is true whenever you have a transformation of Y, whether it’s done for a non-normal response variable or whether it’s done to correct non-constant variance or skewed residuals.

I had not planned to, but will briefly add interpreting transformed Y’s into the workshop.  It won’t be specifically about logistic regression, but generally about transformations.  Because Y is important too.

And if you want more information about the logit transformation and what it does to all types of logistic regression coefficients, I did a webinar last year on interpreting odds ratios in logistic regression.  You can download the webinar video for free.

So you still need to understand dummy variables, centering, correlated predictors, and all that tricky stuff to interpret the odds ratios.  Which is why I recommend learning the tricky stuff in the context of the (relatively) simple linear model before tackling more complicated models.


Bookmark and Share

Confusing Statistical Term #4: Hierarchical Regression vs. Hierarchical Model

December 21st, 2009

This one is relatively simple.  Very similar names for two totally different concepts.

Hierarchical Models (aka Hierarchical Linear Models or HLM) are a type of linear regression models in which the observations fall into hierarchical, or completely nested levels.

Hierarchical Models are a type of Multilevel Models.

So what is a hierarchical data structure, which requires a hierarchical model?

The classic example is data from children nested within schools.  The dependent variable could be something like match scores, and the predictors a whole host of things measured about the child as well as the school.  Child-level predictors could be things like GPA, grade, gender and school-level predictors could be things like: total enrollment, private vs. public, mean SES.

Because multiple children are measured from the same school, their measurements are not independent.  Hierarchical modeling takes that into account.

Hierarchical regression is the practice of building successive linear regression models, each adding more predictors.

For example, one common practice is to start by adding only demographic control variables to the model in one step.   In the next model, you can add predictors of interest, to see if they predict the DV above and beyond the effect of the controls.

You’re actually building separate but related models in each step.  But SPSS has a nice function where it will compare the models, and actually test if successive models fit better than previous ones.

So hierarchical regression is really a series of regular old OLS regression models–nothing fancy, really.

Confusing Statistical Terms #1: Independent Variable

Confusing Statistical Terms #2: Alpha and Beta

Confusing Statistical Terms #3: Levels

Bookmark and Share

Confusing Statistical Terms #2: Alpha and Beta

December 11th, 2009

Oh so many years ago I had my first insight into just how ridiculously confusing all the statistical terminology can be for novices.

I was TAing a two-semester applied statistics class for graduate students in biology.  It started with basic hypothesis testing and went on through to multiple regression.

It was a cross-listed class, meaning there were a handful of courageous (or masochistic) undergrads in the class, and they were having trouble keeping up with the ambitious graduate-level pace.

I remember this one day in particular in the discussion section I was leading when one of the poor undergrads was hopelessly lost.  We were talking about the simple regression coefficient (beta) and the intercept (which the text we were using chose to call alpha, instead of the more familiar beta-naught).

It was only after repeated probing that I realized she was logically trying to fit it into the concepts of alpha and beta that we had already taught her–Type I and Type II errors in hypothesis testing.

Entirely. Different. Concepts.

Once I realized the source of the error, I was able to explain that we were using the same terminology for entirely different concepts.

But as it turns out, there are even more meanings of both alpha and beta.   Here they are:

Hypothesis testing

As I already mentioned, the definition most learners of statistics come to first for beta and alpha are about hypothesis testing.

Alpha is the probability of Type I error in any hypothesis test–incorrectly claiming statistical significance.

Beta is the probability of Type II error in any hypothesis test–incorrectly concluding no statistical significance.  (1 - Beta is power).

Regression coefficients

In most textbooks and software packages, the population regression coefficients are denoted by beta.  Like all population parameters, they are theoretical–we don’t know what they are.  The regression coefficients we estimate from our sample are statistical estimates of those parameter values.  Most parameters are denoted with Greek letters and statistics with the corresponding Latin letters.

Most texts refer to the intercept as β0 (beta-naught–and yes, that’s the closest I can get to a subscript)  and every other regression coefficient as β1, β2, β3, etc.  But as I already mentioned, some statistics texts will refer to the intercept as alpha, to distinguish it from the other coefficients.

Standardized Regression Coefficients

But, for some reason, SPSS labels standardized regression coefficient estimates as Beta.  Despite the fact that they are statistics–measured on the sample, not the population.

More confusion.

And I can’t verify this, but I vaguely recall that Systat uses the same term.  If you have Systat and can verify or negate this claim, feel free to do so in the comments.

Cronbach’s alpha

Another, completely separate use of alpha is Cronbach’s alpha, aka Coefficient Alpha, which measures the reliability of a scale.  It’s a very useful little statistic, but should not be confused with either of the other uses of alpha.


Bookmark and Share