Using pipes in dplyr


Using pipes is a great way to make your life in R simpler by chaining functions. It helps you apply a sequence of functions to a data set to obtain a final result without having to take care of intermediate results (the results of each single function/operation in the sequence). But what does “chaining” mean?
 
Let’s take take the analogy of a trip from A to F. This trip has 4 stops or waypoints before reaching the final destination: the first one at B, then C, then D, then E. To describe this trip, you may list each of the legs:

  • A->B,
  • B->C,
  • C->D,
  • D->E,
  • and E->F.

This is a bit tedious though. There are simpler ways to describe the trip, for example in the form of a chain:

  • A->F via B, C, D and E,
  • or A->B->C->D->E->F.

These two chains imply that you have actually stopped at each intermediate waypoint and then started your journey again to reach the next waypoint on the list. Note that the sequence of waypoints is implied in the syntax of the chain.
 
Chaining in R (and dplyr) consists in listing functions to be applied to a data set, in a sequence that is implied by the syntax, and without having to take care of intermediate results (the result of each function in the list). Using pipes or piping in R (and dplyr) consists in writing such chains using the operator %>%.
 
Here is a very simple, however not so meaningful example. We start with a vector a made of 100 random values created with rnorm(). We first calculate the average of x using mean() and store it in y, then we calculate the square root of y and store in z, and finally we multiply z it by 42. Traditionally we would have written:

x <- rnorm(100, mean = 50, sd=5)
y <- mean(x)
z <- sqrt(y)
round(z,1)

Of course we could have written this in the form of nested functions:

round(sqrt(mean(x)),1)


and obtained the same result, but this looks rather messy and quite unreadable, especially if the sequence of functions is much longer… In addition, the sequence of events appears inverted (first round() then sqrt() and finally mean()).

Using pipes, this would have been written the following way:

x %>% mean()%>% sqrt()%>% round(1)


Contrary to the previous line of code, this way to write is more readable and the sequence of events is more logical: first the mean, then the square root and finally the rounding. Back to the trip analogy, we could have said that we started our journey from x, then went to mean(), then to sqrt(), and then to round(). The symbol %>% is thus somehow translated in then.

Throughout this chapter, we will study several of the most useful functions in dplyr. In most cases, we will apply some forms of piping, sometimes using group_by() as first element in the chain. We will go through the use of group_by() here.

Finally, in a similar way to ggplot2, we may write pipes over several lines to make the code clearer. Instead of writing:

x %>% mean()%>% sqrt()%>% round(1)

one would write:

x %>%
  mean()%>%
  sqrt()%>%
  round(1)
  Fant du det du lette etter? Did you find this helpful?
[Average: 0]