The function summarize()
(which may also be written summarise()
) creates a table in which you will find the result(s) of the summary function(s) you have chosen to apply to a data frame. The summary functions may be:
mean()
: which returns the mean of a variable,sd()
: which returns the standard deviation of a variable,median()
: which returns the median of a variable,min()
: which returns the minimum value of a variable,max()
: which returns the maximum value of a variable,var()
: which returns the variance of a variable,sum()
: which returns the sum of a variable,- etc.
To apply one or more of these summary functions to a data frame, you just have to indicate in summarise()
which function(s) you want to apply and on which variable of the data frame. The syntax is:
summarise(dataframe, function1(variable), function2(variable), ...)
Alternatively, using pipes, the syntax is:
dataframe %>% summarise(function1(variable), function2(variable), ...)
Let’s use the data frame Orange
as an example. The top of the data frame looks like this:
head(Orange)
To calculate the mean and the standard deviation of the variable circumference
, we write either
summarise(Orange, mean(circumference), sd(circumference))
OR
Orange %>% summarise(mean(circumference), sd(circumference))
which both result in:
This example actually does not make much sense in terms of biology. Indeed, we have calculated the average of circumference for different trees, but considering measurements performed at 7 different time points… Instead we could calculate the average circumference and standard deviation for each time point described in age
by using group_by
on the variable age
(read more about group_by
here).
To calculate the group means and standard deviations of the variable circumference
, we write:
Orange %>% group_by(age) %>% summarise(mean(circumference), sd(circumference))
which results in:
Each line in the result table now shows the mean and standard deviation for each of 7 factors in age
described in the first column.