Finding statistical models for analyzing your data


Foreword

Just so that we are clear about it: searching for which statistical model to use or which statistical test to perform on your data is NOT something to do once you have gathered your data, but it is something that is part of the design of your experiment to come! That being said, it is unfortunately common knowledge that the quest for the right model/test often starts at the last moment... when results need to be interpreted.

The content of this page is based on the assumption that you are in the design phase of the study. Nonetheless, you will find it useful also if you are looking for help when analysing an existing data set.

This key will help you define which model or family of models you may use to analyze your data. For using this tool, you have to know whether your data is clustered or not, which distribution describes it, which type(s) of variables are present. You will find more info about variables, data clustering, distribution and hypothesis formulation further below.

Univariate analysis refers to the simplest form of statistical analysis where a single response variable is considered. In this type of analysis, you investigate whether that single response variable is affected by one or several predictor variables.

Choose now between gaussian/normal and non-normal distribution.

See HERE for more info on Data Distribution

Formulate a Hypothesis

The first and possibly most important question when planning an experiment with subsequent data analysis is "what is the working hypothesis?". Often, the experimenter formulates the hypothesis that there exists a difference, a relationship or a correlation between two or more groups based on at least one variable. It is this hypothesis (called H1) which will determine the design of the experiment, the population(s) to study, the groups or treatments, the variables, etc. If the hypothesis is not formulated correctly or unclear, chances are great that the experimental design (and data collection) will be sub-optimal, or even inadequate, and that parts of the study (at best) will be discarded. Hence our strong advice: clearly state your hypothesis! Once you have done it, you "logically" have the null hypothesis H0, i.e. the hypothesis that the difference, relationship or correlation that you "hope for" does not exist. And this is that null hypothesis H0 that most statistical tests take as a starting point.

Example1:

  • H1: the average size of Benthosema glaciale is different in Masfjorden and Lustrafjorden
  • H0: the average size of Benthosema glaciale is NOT different in Masfjorden and Lustrafjorden

Example2:

  • H1: the reproductive success of the bean beetle Callosobruchus maculatus increases with temperature.
  • H0: the reproductive success of the bean beetle Callosobruchus maculatus does NOT increase with temperature.

 
Note that, in this case, the experimenter investigates the impact of temperature on the reproductive success of the bean beetle. The formulation of H1 is particular since it states that the effect is limited to an increase, not a difference (decrease AND increase). Thus, the experimenter must reject H1 and accept H0 if the results show either a decrease, or an absence of effect.

Know Your Variables

It is essential to understand what type of variable(s) you are dealing with and which use you will make of it before starting the analysis. Variable types are numerous: predictor, response, continuous, ordinal, dependent, independent, continuous, etc. and some of these terms are synonymous.

Data Clustering/Dependency

Before starting the analysis, you must determine whether there may exist a form of dependency or clustering in your data, meaning that the values of some observations are dependent due to, for example, experimental design or sampling conditions. Here, the question is whether there is a clustering factor that may be responsible for coordinated variation of a measured variable(s) in several observations. Remember that this is of great importance when choosing a model for your analysis, since some models assume that observations are independent, while others accept dependency.

Nested design is a typical example of clustering. Let's consider a study where the number of eggs laid by 10 female hawks is measured at three different locations during a unique season. Each hawk is found at a specific location and it is plausible that most, if not all hawks from a single location are affected by local conditions (food availability, human presence, etc) which may have an effect on the reproductive capacity of those birds. Thus each location is a cluster and observations within each cluster must be considered as dependent.

Repeated measurement is also a typical example of clustering. Let's consider a study involving several rats which bodyweight is measured at regular intervals. Since individual rats might have different growth rates for various reasons (dominant behavior, health conditions, etc), all measurements specific to one individual are dependent and the individual itself is the cluster.

Determine the Data Distribution

Another essential feature of your data that is crucial when choosing a statistical model or test is its distribution. Often, you will see that the data follows a normal/gaussian. Non-normal distributions include poisson or binomial distribution. Here is more info about them.