group_by()
is a function that groups the cases (rows) of the table according to the different factors of a chosen categorical variable. When used alone, it transforms a data frame into a new table where the factors of the chosen variable are registered as grouped. The output table is then very similar to the original dataframe. When used in combination with a second function in a pipe (read about pipes here), group_by()
splits the data frame by factor, applies the second function to each of the corresponding groups, and finally reassembles the data into a new table.
Let’s use the data frame Orange
as an example. The top of the data frame looks like this:
head(Orange)
Here, for example, we group Orange
by age
and store the result in the object Orange_grouped_by_age
:
Orange_grouped_by_age <- Orange %>% group_by(age)
As you see above here, the data look unchanged, but R says that there exist 7 groups for the variable age
(yellow box).
If we then decide to calculate the mean of circumference
for each factor of age
, we may do so by applying summarise(mean(circumference))
directly on Orange_grouped_by_age
:
Orange_grouped_by_age %>% summarise(mean(circumference))
We thus obtain a new table where the 7 rows show the mean of circumference
for each factor of age
.
For comparison, this is what the same code does when applied to Orange
(the original data frame without grouping):
Orange %>% summarise(mean(circumference))
Note that grouping is reversible, and that you may ungroup data in a table by using the function ungroup()
. In our example, simply type:
ungroup(Orange_grouped_by_age)
As you may see, the line that used to show the groups is now gone.