Posts Tagged ‘Random Factors’

Confusing Statistical Terms #1: The Many Names of Independent Variables

Monday, November 24th, 2008

Statistical models, such as general linear models (linear regression, ANOVA, mixed models) and generalized linear models (logistic, Poisson, proportional hazard regression, etc.) all have the same general form.  On the left side of the equation is one or more response variables, Y.  On the right hand side is one or more predictor variables, X,  and their coefficients, BX, the variables on the right hand side can have many forms and are called by many names.

There are subtle distinctions in the meanings of these names, but they are often used interchangeably.  Even worse, statistical software packages use different names for similar concepts, even among their own procedures.  This quest for accuracy often renders confusion.  (It’s hard enough without switching the words!).

Here are some common terms that all refer to a variable in a model that is proposed to affect or predict another variable.  There are slight differences in the meanings of these terms, but they are often used interchangeably.

  • Independent Variable: It implies causality:  the independent variable affects the dependent variable.  Used predominantly in ANOVA, but often in regression as well.  It can be either continuous or categorical.
  • Predictor Variable:  It does not imply causality.  A predictor variable is simply useful for predicting the value of the response variable.  Used predominantly in regression.  Predictor variables can be continuous or categorical.
  • Predictor:  Same as Predictor Variable.
  • Covariate:  A continuous predictor variable.  Used in both ANCOVA (analysis of covariance) and regression.  Some people use this to refer to all predictor variables in regression, but it really means continuous predictors.  Adding a covariate to ANOVA (analysis of variance) turns it into ANCOVA (analysis of covariance).
  • Factor:  A categorical predictor variable.  It may or may not indicate a cause/effect relationship with the response variable (this depends on the study design, not the analysis).  Independent variables in ANOVA are almost always called factors.  In regression, they are often referred to as indicator variables, categorical predictors, or dummy variables.  They are all the same thing in this context.
  • Grouping Variable: Same as a factor.  Used in SPSS in the independent samples t-test.
  • Fixed factor:  A categorical independent variable in which the specific values of the categories are specific and important, often chosen by the experimenter.  Examples include experimental treatments or demographic categories, such as sex and race.  If you’re not doing a mixed model (and you should know if you are), all your factors are fixed factors.  For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models
  • Dummy variable:  A categorical variable that has been dummy coded.  Dummy coding (also called indicator coding) is usually used in regression models, but not ANOVA.  A dummy variable can have only two values: 0 and 1.  When a categorical variable has more than two values, it is recoded into multiple dummy variables.
  • Indicator variable: See dummy variable.


Bookmark and Share

Specifying Fixed and Random Factors in Mixed Models

Wednesday, September 24th, 2008

Since SAS introduced Proc Mixed about fifteen years ago, S-Plus, Stata and SPSS have implemented procedures to analyze mixed models, greatly broadening the options available to researchers. These programs require correctly specifying the fixed and random factors of the model to obtain accurate analyses. The definitions in many texts often do not help with decisions to specify factors as fixed or random, since textbook examples are often artificial and hard to apply. Furthermore, the same factor can often be considered fixed or random, depending on the objective; This newsletter outlines a different way to think about fixed and random factors.

Consider an experiment that examines beetle damage on cucumbers. The experiment is replicated at five farms and on four fields at each farm. There are two varieties of cucumbers, and beetle damage is assessed on each of 50 plants at the end of the season. The researcher is interested in comparing differences in how much damage the two varieties sustain. The experiment then has the following factors: VARIETY, FARM, and FIELD.

Fixed factors can be thought of in terms of differences. The effect of a categorical fixed factor is defined by differences from the overall mean and the effect of a continuous fixed factor is defined by its slope–how the mean of the dependent variable differs with alternate values of the factor. The output for fixed factors provides estimates for mean-differences or slopes. Conclusions regarding fixed factors are particular to the values of these factors. For example, if one variety of cucumber is found to suffer significantly less damage than the other, this says nothing about cucumber varieties that were not tested.

Random factors, on the other hand, are defined by a distribution and not by differences. The values of a random factor are assumed to be chosen from a population with a normal distribution with a certain variance. The output for a random factor is an estimate of this variance and not a set of differences from a mean. Conclusions regarding random factors should be expressed in terms of variance. For example, we may find that the variability among fields makes up a certain percentage of the overall variability in beetle damage.

Situations that indicate fixed factors:

  1. The factor is the primary treatment that the researcher wants to compare. In our example, VARIETY is definitely fixed as the researcher wants to compare the mean beetle damage on the two varieties.
  2. The factor is a secondary covariate that might be confounded with the treatment, and the researcher wants to control for differences in this covariate. If these farms were specifically chosen for some feature they had, such as specific soil types or topographies that may affect beetle damage, and if the researcher would like to compare the farms as representatives of those soil types, then FARM should be fixed.
  3. The factor has only two values. Even if everything else indicates that a factor should be random, if it has only two values, the variance cannot be calculated, and it should be fixed.

Situations that indicate random factors:

  1. The researcher is interested in quantifying how much of the overall variation to attribute to this factor. If the researcher was interested in how much of the variation in beetle damage was attributable to the farm at which the damage took place, FARM would be random.
  2. The researcher is not interested in knowing which means differ, but wants to account for the variation in this factor. If the farms were chosen at random, not for a specific feature, but because the researcher suspected that there is some variation in their soil types, which is representative of the variation across all farms, FARM should be random.
  3. The researcher would like to generalize the conclusions about this factor to the whole population. There is nothing about comparing these specific fields that is of interest to the researcher. Rather, the researcher wants to generalize the results of this experiment to all fields, so FIELD is random.
  4. Any interaction with a random factor is also random.

How the factors of a model are specified can have great influence on the results of the analysis and on the conclusions drawn.