Finding statistical models for analyzing your data

Just so that we are clear about it: searching for which statistical model to use or which statistical test to perform on your data is NOT something to do once you have gathered your data, but it is something that is part of the design of your experiment to come! That being said, it is unfortunately common knowledge that the quest for the right model/test often starts at the last moment... when results need to be interpreted.

The content of this page is based on the assumption that you are in the design phase of the study. Nonetheless, you will find it useful also if you are looking for help when analysing an existing data set.

Univariate Analysis

Multivariate Analysis

Gaussian/Normal Distribution

Non-Normal Distribution

The first and possibly most important question when planning an experiment with subsequent data analysis is "what is the working hypothesis?". Often, the experimenter formulates the hypothesis that there exists a difference, a relationship or a correlation between two or more groups based on at least one variable. It is this hypothesis (called H₁) which will determine the design of the experiment, the population(s) to study, the groups or treatments, the variables, etc. If the hypothesis is not formulated correctly or unclear, chances are great that the experimental design (and data collection) will be sub-optimal, or even inadequate, and that parts of the study (at best) will be discarded. Hence our strong advice: clearly state your hypothesis! Once you have done it, you "logically" have the null hypothesis H₀, i.e. the hypothesis that the difference, relationship or correlation that you "hope for" does not exist. And this is that null hypothesis H₀ that most statistical tests take as a starting point.

Example2:

H₁: the reproductive success of the bean beetle Callosobruchus maculatus increases with temperature.
H₀: the reproductive success of the bean beetle Callosobruchus maculatus does NOT increase with temperature.

Note that, in this case, the experimenter investigates the impact of temperature on the reproductive success of the bean beetle. The formulation of H₁ is particular since it states that the effect is limited to an increase, not a difference (decrease AND increase). Thus, the experimenter must reject H₁ and accept H₀ if the results show either a decrease, or an absence of effect.

Categorical vs. Continuous Variables

Categorical Variables

Continuous Variables

"Tricky" Cases

Predictor vs. Response Variables

Before starting the analysis, you must determine whether there may exist a form of dependency or clustering in your data, meaning that the values of some observations are dependent due to, for example, experimental design or sampling conditions. Here, the question is whether there is a clustering factor that may be responsible for coordinated variation of a measured variable(s) in several observations. Remember that this is of great importance when choosing a model for your analysis, since some models assume that observations are independent, while others accept dependency.

Nested design is a typical example of clustering. Let's consider a study where the number of eggs laid by 10 female hawks is measured at three different locations during a unique season. Each hawk is found at a specific location and it is plausible that most, if not all hawks from a single location are affected by local conditions (food availability, human presence, etc) which may have an effect on the reproductive capacity of those birds. Thus each location is a cluster and observations within each cluster must be considered as dependent.

Normal/Gaussian Distribution

Poisson Distribution

Binomial Distribution

When biology adds up, at last…

When biology adds up, at last…

Finding statistical models for analyzing your data

Foreword

Formulate a Hypothesis

Know Your Variables

Nominal variables

Ordinal/Ranked variables

Dichotomous/Binary variables

Interval variables

Ratio variables

Count data

Year

Likert scale

Data Clustering/Dependency

Determine the Data Distribution