One-way ANOVA is a parametric test designed to **compare the means of three or more groups.** The null hypothesis states that the means of all groups to be tested are equal. As usual, the test will return a p-value in the end, and you will be able to decide whether or not to reject the null hypothesis depending on this p-value.

Assumptions are:

**independence**of observations (each individual is represented by 1 entry/measurement ONLY)**normality of distribution**(to be tested for each group, for example with the Shapiro-Wilk test)**homogeneity of variance**(to be tested with, for example, Levene’s test).

The function to use in R is `lm()`

followed by `anova()`

. This option is used to fit a linear model and will work in virtually all cases. A second option works as well and involves the function `aov()`

; however you must know that this option is restricted to *balanced design* (where groups have equal numbers of entries, i.e. number of observations is the same for all groups).

Let’s take an example. Here, let’s say that we want to check whether the average size of blue ground beetles (*Carabus intricatus*) differs depending on their location. We consider 3 different locations, for example 3 forests beautifully named A, B and C. In each location, we measure the size (in millimeters) of 10 individuals.

In Excel, the table containing the data would look like this (click to enlarge):

To create the corresponding dataframe in R, use the following code:

size <- c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22) location <- as.factor(c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10))) my.dataframe <- data.frame(size,location) my.dataframe

and the resulting dataframe is:

It is always nice and useful to get an overview of the whole dataset, so let’s plot the data:

plot(size~location, data=my.dataframe)

Now we need to check the assumptions of **normality of distribution and homogeneity of variance**. We thus run the Shapiro-Wilk test on each group and then Levene’s test (for which you will need to load/activate the package `car`

via the command` library(car)`

).

library(car) shapiro.test(my.dataframe$size[location=="ForestA"]) shapiro.test(my.dataframe$size[location=="ForestB"]) shapiro.test(my.dataframe$size[location=="ForestC"]) leveneTest(size~location, data=my.dataframe, center=mean)

So, each of the 3 groups (`ForestA`

, `ForestB`

and `ForestC`

) is asumed to come from normal distribution since the p-value of the Shapiro-Wilk test is greater than 0.05; additionally, variances are not different according to Levene’s test (p-value greater than 0.05).

**Note:** if you are a bit confused about the way data/groups are retrieved for running the Shapiro-Wilk test, here is a quick explanation. Let’s consider the group ForestA: we need to tell the function to retrieve all `size`

data located in the object `my.dataframe`

(hence `my.dataframe$size`

) but we need to restrict the selection to data matching the criteria ForestA only (hence `[location==ForestA]`

). Putting everything together, we write `my.dataframe$size[location=="ForestA"]`

inside` shapiro.test(). `

**Let’s see how to run the ANOVA**

We consider the first option using `lm()`

. The syntax is `lm(variable ~ groups, data=dataframe)`

where `variable`

is the vector that contain the response variable, `groups`

is the vector that contains the grouping variable or factor (which categorizes the observations) and `dataframe`

the name of the dataframe that contains the data. We first need to fit a linear model with `lm()`

and then we store the results in the object `results.lm`

and print them out using `anova()`

:

results.lm <- lm(size~location, data=my.dataframe) anova(results.lm)

This output provides you with the F-value (7.1101) and the corresponding p-value (0.003307). The hypothesis stating that the means of the groups are equal is apparently to be rejected.

The second option implies that we run the ANOVA on the dataframe with `aov()`

. The syntax is very similar to `lm()`

. Here, we store the results in the object `results`

, then we “print” some of the data in `results`

using` summary(results)`

:

results <- aov(size~location, data=my.dataframe) summary(results)

This output gives the value F of the statistic (here F=7.11) and the p-value (0.00331) and you rapidly notice that these are very close to the results obtained with `lm()`

, at least in this example. Here, the ANOVA test tells that the null hypothesis is to be rejected and that there exists a significant difference between some of the groups, nothing more.

**But this does not tell us anything about the groups which means are significantly different…**

Indeed, the ANOVA needs to be followed by another test if we want to check which of the groups are different from the others. For that we’ll need a *post-hoc* test, possibly a pairwise t-test or a Tukey HSD.

**What to do if the assumption of normality is not met?**

In this case you may simply apply the non-parametric Kruskal-Wallis test.

The syntax is the following:

kruskal.test(size~location, data=my.dataframe)

and the output looks like this:

Again, the test shows that the null hypothesis may be rejected. There are differences between the group means.