Posts Tagged ‘Fixed Factors’

Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 2

Tuesday, March 31st, 2009

Yesterday’s post outlined one issue in deciding whether to put a categorical predictor variable into Fixed Factors or Covariates in SPSS GLM.  That issue dealt with how SPSS automatically creates dummy variables out of any variable in Fixed Factors.

Another default to keep in mind is that SPSS will automatically create interactions between any and all variables in Fixed Factors.  If you put 5 variables in Fixed Factors, you’ll get all 2-way, 3-way, 4-way, and even a 5-way interaction among those 5 variables. (more…)

Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1

Monday, March 30th, 2009

If you have a categorical variable that you plan to use in a regression analysis in SPSS, there are a couple ways to do it. You can use the SPSS Regression procedure, which I will talk about more in another post.  Or you can use SPSS GLM, which I discuss here, and in a  follow-up post.

The big question in SPSS GLM is what goes where.  As I’ve detailed in another post, any continuous independent variable goes into covariates.  And don’t use random factors at all unless you really know what you’re doing.

So the question is what to do with your categorical variables.  You have two choices, and each has advantages and disadvantages.

The easiest is to put categorical variables in Fixed Factors.  SPSS will dummy code those variables for you, which is quite convenient if your categorical variable has more than two categories.  However, there are some defaults you need to be aware of that may or may not make this a good choice.

SPSS always makes the reference group the one that comes last alphabetically.  So if the values you input are strings, it will be the one that comes last.  If those values are numbers, it will be the highest one.

In some studies it really doesn’t matter which is the reference group.  But in others, interpreting regression coefficients will be a whole lot easier if you choose a group that makes a good comparison, such as a control group or the most common group in the data.  If you want that to be the reference, make it come last alphabetically.  I’ve been known to do things like change my data so that the control group becomes something like ZControl.  (But create a new variable–never overwrite original data).

It really can get confusing, though, if the variable was already dummy coded–if it already had values of 0 and 1.  Because 1 comes last alphabetically, SPSS will make that group the reference group.  This can really lead to confusion when interpreting coefficients.  It’s not impossible if you’re paying attention, but you do have to pay attention.

In tomorrow’s post I’ll discuss another default in SPSS that will affect your decision.

If you want more information on using and interpreting parameter estimates in regression using SPSS, get the recording from my free teleseminar: Interpreting Regression Coefficients: A Walk Through Output.

Editor’s Update 10/9/09: In just a few weeks, I’ll be offering a 3-hour workshop on the ins and outs of SPSS GLM.  We’ll cover the defaults, the menus and syntax, the meanings of all these terms, when you need each option, and what the results mean.  Get more info and register at: http://theanalysisinstitute.com/workshops/SPSS-GLM/index.html

SPSS GLM: Choosing Fixed Factors and Covariates

Tuesday, December 30th, 2008

The beauty of the Univariate GLM procedure in SPSS is that it is so flexible.  You can use it to analyze regressions, ANOVAs, ANCOVAs with all sorts of interactions, dummy coding, etc.

The down side of this flexibility is it is often confusing what to put where and what it all means.

So here’s a quick breakdown.

The dependent variable I hope is pretty straightforward.  Put in your continuous dependent variable.

Fixed Factors are categorical independent variables.  It does not matter if the variable is (more…)

Confusing Statistical Terms #3: Levels of a Factor in Multilevel Models Measured at a Nominal Level

Friday, December 12th, 2008

It struck me today in answering a question that statisticians have not been very helpful to those trying to learn statistics in the way they name statistical terms.

I can think of other examples (how many totally different concepts does alpha refer to in statistics?), but the term I was using today was levels.

Specifically, there are Multilevel models with two or more sources of random variation.  A two level model has two sources of random variation, and can have predictors at each level.  A common example is where students are sampled within schools.  Predictors can be measured at the student level (eg. gender, SES, age) or the school level (enrollment, % who go on to college).  The dependent variable has variation from student to student (level 1) and from school to school (level 2).

If a predictor is a fixed factor (meaning it is a categorical predictor), it can have two or more levels, meaning categories.  In ANOVA, factors (categorical independent variables) have 2 or more levels (2 or more categories).

Then we get to levels of measurement: nominal, ordinal, interval, ratio.  These levels refer to how much information a variable contains.  Does it indicate a category, indicate a quantity, etc?

So, a factor with 3 levels that is measured at level 2 of a model has a nominal level of measurement.

What, you’re not following me?  I wonder why…..

————————————————————————————————————–

Confusing Statistical Terms #1: Independent Variable

Confusing Statistical Terms #2: Alpha and Beta


Bookmark and Share

Confusing Statistical Terms #1: The Many Names of Independent Variables

Monday, November 24th, 2008

Statistical models, such as general linear models (linear regression, ANOVA, mixed models) and generalized linear models (logistic, Poisson, proportional hazard regression, etc.) all have the same general form.  On the left side of the equation is one or more response variables, Y.  On the right hand side is one or more predictor variables, X,  and their coefficients, BX, the variables on the right hand side can have many forms and are called by many names.

There are subtle distinctions in the meanings of these names, but they are often used interchangeably.  Even worse, statistical software packages use different names for similar concepts, even among their own procedures.  This quest for accuracy often renders confusion.  (It’s hard enough without switching the words!).

Here are some common terms that all refer to a variable in a model that is proposed to affect or predict another variable.  There are slight differences in the meanings of these terms, but they are often used interchangeably.

  • Independent Variable: It implies causality:  the independent variable affects the dependent variable.  Used predominantly in ANOVA, but often in regression as well.  It can be either continuous or categorical.
  • Predictor Variable:  It does not imply causality.  A predictor variable is simply useful for predicting the value of the response variable.  Used predominantly in regression.  Predictor variables can be continuous or categorical.
  • Predictor:  Same as Predictor Variable.
  • Covariate:  A continuous predictor variable.  Used in both ANCOVA (analysis of covariance) and regression.  Some people use this to refer to all predictor variables in regression, but it really means continuous predictors.  Adding a covariate to ANOVA (analysis of variance) turns it into ANCOVA (analysis of covariance).
  • Factor:  A categorical predictor variable.  It may or may not indicate a cause/effect relationship with the response variable (this depends on the study design, not the analysis).  Independent variables in ANOVA are almost always called factors.  In regression, they are often referred to as indicator variables, categorical predictors, or dummy variables.  They are all the same thing in this context.
  • Grouping Variable: Same as a factor.  Used in SPSS in the independent samples t-test.
  • Fixed factor:  A categorical independent variable in which the specific values of the categories are specific and important, often chosen by the experimenter.  Examples include experimental treatments or demographic categories, such as sex and race.  If you’re not doing a mixed model (and you should know if you are), all your factors are fixed factors.  For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models
  • Dummy variable:  A categorical variable that has been dummy coded.  Dummy coding (also called indicator coding) is usually used in regression models, but not ANOVA.  A dummy variable can have only two values: 0 and 1.  When a categorical variable has more than two values, it is recoded into multiple dummy variables.
  • Indicator variable: See dummy variable.


Bookmark and Share

Specifying Fixed and Random Factors in Mixed Models

Wednesday, September 24th, 2008

Since SAS introduced Proc Mixed about fifteen years ago, S-Plus, Stata and SPSS have implemented procedures to analyze mixed models, greatly broadening the options available to researchers. These programs require correctly specifying the fixed and random factors of the model to obtain accurate analyses. The definitions in many texts often do not help with decisions to specify factors as fixed or random, since textbook examples are often artificial and hard to apply. Furthermore, the same factor can often be considered fixed or random, depending on the objective; This newsletter outlines a different way to think about fixed and random factors.

Consider an experiment that examines beetle damage on cucumbers. The experiment is replicated at five farms and on four fields at each farm. There are two varieties of cucumbers, and beetle damage is assessed on each of 50 plants at the end of the season. The researcher is interested in comparing differences in how much damage the two varieties sustain. The experiment then has the following factors: VARIETY, FARM, and FIELD.

Fixed factors can be thought of in terms of differences. The effect of a categorical fixed factor is defined by differences from the overall mean and the effect of a continuous fixed factor is defined by its slope–how the mean of the dependent variable differs with alternate values of the factor. The output for fixed factors provides estimates for mean-differences or slopes. Conclusions regarding fixed factors are particular to the values of these factors. For example, if one variety of cucumber is found to suffer significantly less damage than the other, this says nothing about cucumber varieties that were not tested.

Random factors, on the other hand, are defined by a distribution and not by differences. The values of a random factor are assumed to be chosen from a population with a normal distribution with a certain variance. The output for a random factor is an estimate of this variance and not a set of differences from a mean. Conclusions regarding random factors should be expressed in terms of variance. For example, we may find that the variability among fields makes up a certain percentage of the overall variability in beetle damage.

Situations that indicate fixed factors:

  1. The factor is the primary treatment that the researcher wants to compare. In our example, VARIETY is definitely fixed as the researcher wants to compare the mean beetle damage on the two varieties.
  2. The factor is a secondary covariate that might be confounded with the treatment, and the researcher wants to control for differences in this covariate. If these farms were specifically chosen for some feature they had, such as specific soil types or topographies that may affect beetle damage, and if the researcher would like to compare the farms as representatives of those soil types, then FARM should be fixed.
  3. The factor has only two values. Even if everything else indicates that a factor should be random, if it has only two values, the variance cannot be calculated, and it should be fixed.

Situations that indicate random factors:

  1. The researcher is interested in quantifying how much of the overall variation to attribute to this factor. If the researcher was interested in how much of the variation in beetle damage was attributable to the farm at which the damage took place, FARM would be random.
  2. The researcher is not interested in knowing which means differ, but wants to account for the variation in this factor. If the farms were chosen at random, not for a specific feature, but because the researcher suspected that there is some variation in their soil types, which is representative of the variation across all farms, FARM should be random.
  3. The researcher would like to generalize the conclusions about this factor to the whole population. There is nothing about comparing these specific fields that is of interest to the researcher. Rather, the researcher wants to generalize the results of this experiment to all fields, so FIELD is random.
  4. Any interaction with a random factor is also random.

How the factors of a model are specified can have great influence on the results of the analysis and on the conclusions drawn.