Whereas one-way ANOVA allows for comparison of three and more group means based on the different levels of a single factor, factorial design allows for comparison of groups based on several independent variables and their various levels. Thus, comparing data (such as tree size, egg size…) split in groups according to two variables (for example location and species) may be done via a two-way ANOVA; if data are split in groups according to three independent variables (for example location, species and date of measurement), then a three-way ANOVA may be used… Further below, we’ll essentially talk about two-way ANOVA essentially, but you’ll quickly understand how the functions may be adapted to more factors.
Using multi-way ANOVA, we get to assess main effects, in other words the impact of single, independent factors on something we are interested in (plant size, egg number, individual weight…). We also get to see whether there exist interactions between factors, in other words whether the effect of a specific factor is influenced by (one of the) other factors.
As for one-way ANOVA, the functions to use are
aov(). Again, use
aov() only in case of balanced design (equal number of observations in the groups). In this function, the first argument to mention is the dependent variable (what has been measured) followed by
~ and the independent variables of the dataset. IF the independent variables are separated by a
* (for example
factor1*factor2), both interactions and main effects will be tested; IF independent variables are separated by a
+ (for example
factor1+factor2), only main effects will be tested.
Let’s see that in an example close to the one that we looked at in One-way ANOVA. Here, we check whether the average size of blue ground beetles (Carabus intricatus) differs depending on their location. We now introduce a new factor: measurements were performed in 2005 and in 2015 at the same 3 forests A, B and C. In each location, we measured the size (in millimeters) of 10 individuals (making it a balanced design). In total, 60 individuals were measured. The two factors are
Here is the code for creating the dataframe:
size <- c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22,24,27,26,24,25,27,22,28,25,24,27,29,26,27,25,27,28,24,24,26,21,23,25,20,25,23,25,19,22,21) location <- as.factor(c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10), rep("ForestA",10), rep("ForestB",10), rep("ForestC",10))) year <- as.factor(c(rep("2005",30), rep("2015",30))) my.dataframe <- data.frame(size,location,year) my.dataframe
The resulting dataframe looks like this (click to enlarge):
The assumptions are globally the same as for one-way ANOVA:
- independence of observations (each individual is represented by 1 entry/measurement ONLY)
- normality of distribution (to be tested for each group, for example with the Shapiro-Wilk test)
- homogeneity of variance (to be tested with, for example, Levene’s test).
We start by visualizing the data with boxplots. This time, since we have several factors, we’ll may split charts according to these factors. The following “simple” code takes care of the problem: it first gathers all data for each of the locations regardless of the factor
year, and then for each year regardless of
location (basically it makes plots based on single factors).
par(mfrow=c(1,2)) plot(size ~ location + year, data=my.dataframe)
If you prefer a more classic boxplot with all 6 groups/combinations, go for this code:
boxplot(size ~ location*year, data=my.dataframe)
Before going further, we need to check for normality of distribution with the Shapiro-Wilk test. Since we have 6 groups to check based on two parameters in our dataframe, the syntax is a bit more complicated than usual. For each group , we need to restrict the test to the data matching both a specific
location and a specific
year. We write it like this:
shapiro.test(my.dataframe$size[location=="ForestA" &amp; year=="2005"]) shapiro.test(my.dataframe$size[location=="ForestB" &amp; year=="2005"]) shapiro.test(my.dataframe$size[location=="ForestC" &amp; year=="2005"]) shapiro.test(my.dataframe$size[location=="ForestA" &amp; year=="2015"]) shapiro.test(my.dataframe$size[location=="ForestB" &amp; year=="2015"]) shapiro.test(my.dataframe$size[location=="ForestC" &amp; year=="2015"])
None of the tests shows a p-value under 0.05. The null hypothesis stating that the data in the groups are normally distributed is thus accepted.
Let’s go on with Levene’s test for homogeneity of variance. Do not forget to activate the package car prior to the test otherwise “Error: could not find function “leveneTest” will be the only answer that you will get:
library(car) leveneTest(size~location*year, data=my.dataframe, center=mean)
The p-value is largely greater than 0.05. We thus assume that variances are equal.
Let’s go for the ANOVA test!
We first have a look at the option using
results.interaction <- aov(size ~ location*year) summary(results.interaction)
The table above shows the F value and p-value for each of the main effects,
year, as well as for the interaction
location:year. It also conveniently indicates with stars where significance levels are reached. For instance,
size is significantly affected by the factor
location, and the p-value is very small (p<0.001, ***).
Let’s have a look at the option using
results.interaction.lm <- lm(size ~ location*year) anova(results.interaction.lm)
Again you notice that these results are very similar to those provided by
aov() and the conclusion about rejecting the null hypothesis is the same.
But this does not tell us anything about the groups which means are significantly different…
Indeed, this test tells you about main effects and interactions, but does not tell you which groups are significantly different from others. For that, we will need post hoc tests such as Tukey HSD.