One-way ANOVA is a parametric test designed to compare the means of three or more groups. The null hypothesis states that the means of all groups to be tested are equal. As usual, the test will return a p-value in the end, and you will be able to decide whether or not to reject the null hypothesis depending on this p-value.
- independence of observations (each individual is represented by 1 entry/measurement ONLY)
- normality of distribution (to be tested for each group, for example with the Shapiro-Wilk test)
- homogeneity of variance (to be tested with, for example, Levene’s test).
The function to use in R is
lm() followed by
anova(). This option is used to fit a linear model and will work in virtually all cases. A second option works as well and involves the function
aov(); however you must know that this option is restricted to balanced design (where groups have equal numbers of entries, i.e. number of observations is the same for all groups).
Let’s take an example. Here, let’s say that we want to check whether the average size of blue ground beetles (Carabus intricatus) differs depending on their location. We consider 3 different locations, for example 3 forests beautifully named A, B and C. In each location, we measure the size (in millimeters) of 10 individuals.
In Excel, the table containing the data would look like this (click to enlarge):
To create the corresponding dataframe in R, use the following code:
size <- c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22) location <- as.factor(c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10))) my.dataframe <- data.frame(size,location) my.dataframe
and the resulting dataframe is:
It is always nice and useful to get an overview of the whole dataset, so let’s plot the data:
Now we need to check the assumptions of normality of distribution and homogeneity of variance. We thus run the Shapiro-Wilk test on each group and then Levene’s test (for which you will need to load/activate the package
car via the command
library(car) shapiro.test(my.dataframe$size[location=="ForestA"]) shapiro.test(my.dataframe$size[location=="ForestB"]) shapiro.test(my.dataframe$size[location=="ForestC"]) leveneTest(size~location, data=my.dataframe, center=mean)
So, each of the 3 groups (
ForestC) is asumed to come from normal distribution since the p-value of the Shapiro-Wilk test is greater than 0.05; additionally, variances are not different according to Levene’s test (p-value greater than 0.05).
Note: if you are a bit confused about the way data/groups are retrieved for running the Shapiro-Wilk test, here is a quick explanation. Let’s consider the group ForestA: we need to tell the function to retrieve all
size data located in the object
my.dataframe$size) but we need to restrict the selection to data matching the criteria ForestA only (hence
[location==ForestA]). Putting everything together, we write
Let’s see how to run the ANOVA
We consider the first option using
lm(). The syntax is
lm(variable ~ groups, data=dataframe) where
variable is the vector that contain the response variable,
groups is the vector that contains the grouping variable or factor (which categorizes the observations) and
dataframe the name of the dataframe that contains the data. We first need to fit a linear model with
lm() and then we store the results in the object
results.lm and print them out using
results.lm <- lm(size~location, data=my.dataframe) anova(results.lm)
This output provides you with the F-value (7.1101) and the corresponding p-value (0.003307). The hypothesis stating that the means of the groups are equal is apparently to be rejected.
The second option implies that we run the ANOVA on the dataframe with
aov(). The syntax is very similar to
lm(). Here, we store the results in the object
results, then we “print” some of the data in
results <- aov(size~location, data=my.dataframe) summary(results)
This output gives the value F of the statistic (here F=7.11) and the p-value (0.00331) and you rapidly notice that these are very close to the results obtained with
lm(), at least in this example. Here, the ANOVA test tells that the null hypothesis is to be rejected and that there exists a significant difference between some of the groups, nothing more.
But this does not tell us anything about the groups which means are significantly different…
Indeed, the ANOVA needs to be followed by another test if we want to check which of the groups are different from the others. For that we’ll need a post-hoc test, possibly a pairwise t-test or a Tukey HSD.
What to do if the assumption of normality is not met?
In this case you may simply apply the non-parametric Kruskal-Wallis test.
The syntax is the following:
and the output looks like this:
Again, the test shows that the null hypothesis may be rejected. There are differences between the group means.