Entire data frames may be put together either beside each other (thus increasing the number of variables) or below each other (thus increasing the number of cases) into a single, large table. Here we focus on combining data frames beside each other.

One of the functions that can do such an operation is `bind_cols()`

. Note that `bind_cols()`

may be applied ONLY to data frames with equal length (number of cases). If the data frames are different in length, you will have to use another function such as `left_join`

or `right_join`

(among others), which may add `NA`

wherever necessary.

To illustrate how `bind_cols()`

works, we will use the data frames `Orange`

and `Orange2`

as examples. `Orange2`

is a data frame similar to `Orange`

, with the difference that the values of the variable `circumference`

have been multiplied by 5 using the following line of code:

Orange2 <- Orange %>% mutate_at(vars(circumference), list(~.*5))

We thus have the following two data frames `Orange`

and `Orange2`

:

Note that the data frames have three identical variables: `Tree`

, `age`

, and `circumference`

. We can combine the two data frames with the following code:

bind_cols(Orange, Orange2)

which gives us a table with 35 rows (like the original data frames), but 6 columns instead of 3.

As you may see here, `bind_cols()`

does not automatically work on recognizing identical variables, but rather copies all the variables of `Orange2`

to the right of the variables of `Orange`

, and then adds a digit next to the name of a variable which has been encountered before.

`bind_cols()`

does not (need to) rename variables if they have been used already once. Here is what happens when the data frames have (at least) one variable which is not common to both. Let’s use `Orange3`

which is similar to `Orange`

, with the difference that the variable `circumference`

has been renamed to `circumferenceNEW`

using the following code:

Orange3 <- Orange %>% rename(circumferenceNEW = circumference) head(Orange3)

bind_cols(Orange, Orange3)

We end up with a table made of 35 observations and 6 variables, but neither `circumference`

nor `circumferenceNEW`

has been renamed. Still, `Tree`

and `age`

are found in duplicates.

Entire data frames may be put together either beside each other (thus increasing the number of variables) or below each other (thus increasing the number of cases) into a single, large table. Here we focus on combining data frames below each other.

One of the functions that can do such an operation is `bind_rows()`

. To illustrate how `bind_rows()`

works, we will use the data frames `Orange`

and `Orange2`

as examples. `Orange2`

is a data frame similar to `Orange`

, with the difference that the values of the variable `circumference`

have been multiplied by 5 using the following line of code:

Orange2 <- Orange %>% mutate_at(vars(circumference), list(~.*5))

We thus have the following two data frames `Orange`

and `Orange2`

:

Note that the data frames have three identical variables: `Tree`

, `age`

, and `circumference`

. We can combine the two data frames with the following code which gives :

bind_rows(Orange, Orange2)

which gives us a long table with 70 (2*35) rows, the first 35 rows being those from `Orange`

and the last 35 rows being from `Orange2`

:

The example above is rather simple since the variables in the data frames are the same. What if things were different?

Here is what happens when the data frames have (at least) one variable which is not common to both. Let’s use `Orange3`

which is similar to `Orange`

, with the difference that the variable `circumference`

has been renamed to `circumferenceNEW`

using the following code:

Orange3 <- Orange %>% rename(circumferenceNEW = circumference) head(Orange3)

bind_rows(Orange, Orange3)

We end up with a 70 row-long table made of one more column than the original data frames, and that contains both `circumference`

and `circumferenceNEW`

. On top of that, the value `NA`

has been placed whenever the observations did not have a value for the new variable.

There is a family of four functions that allows for joining variables from two tables X and Y while matching values to the rows they correspond to. In other words, the functions check whether there are common rows and columns, before putting the data together. Depending of which of these functions you plan to use, you will be able to restrict the output table to only the observations that were found in X, only the observations that are common to both X and Y, only the observations that were found in Y, or retain all observations. This is called *mutate joining*, and the four functions are:

`left_join()`

: join data while matching values from Y to X, keeping only the observations found in X,`right_join()`

: join data while matching values from X to Y, keeping only the observations found in Y,`inner_join()`

: join data, keeping only the observations which are common to both tables,`full_join()`

: join data, keeping all the observations.

The icons found in Rstudio’s cheat sheet Data Transformation with dplyr : : CHEAT SHEET are of great visual help when trying to figure out the output of these “join” functions. We will keep them here as reference. These icons show what happens to 2 input tables X and Y (represented below) when handled by the 4 functions.

We will practically illustrate the use and purpose of the functions using `OrangeX`

and `OrangeY`

which are modified fragments of the data frame `Orange`

. These fragments are generated using the following codes:

OrangeX <- Orange %>% filter(Tree == "1" | Tree == "2" | Tree == "3") OrangeY <- Orange %>% filter(Tree == "3" | Tree == "4" | Tree == "5") %>% mutate(double_circumference = circumference*2) %>% select(Tree, age, double_circumference)

and they look like this:

NB: as you may see here above, `OrangeX`

and `OrangeY`

have two variables in common: `Tree`

and `age`

; in addition, there are 7 common rows where Tree = 3.

`left_join()`

returns a table with all the observations from the table X, and keeps all the variables found in X and Y. It keeps observations from Y only if they are found in X. The observations in X with no match in Y will have NA values in the new columns. If there are multiple matches between X and Y, all combinations of the matches are returned.

Let’s see that with `OrangeX`

and `OrangeY`

:

left_join(OrangeX, OrangeY)

Here `left_join()`

has restricted the output table to the 21 observations that are found in `OrangeX`

, and has added the column `double_circumference`

found in `OrangeY`

. `left_join`

has also found that there are 7 observations that are common to `OrangeX`

and `OrangeY`

, where `Tree`

equals 3 and for which values in `age`

match too. Therefore, it displays these 7 common observations with values for both `circumference`

and `double_circumference`

. Note that the other 14 observations have now the value `NA`

in `double_circumference`

since none of them was found in `OrangeY`

.

`right_join()`

returns a table with all the observations from the table Y, and keeps all the variables found in X and Y. It keeps observations from X only if they are found in Y. The observations in Y with no match in X will have NA values in the new columns. If there are multiple matches between X and Y, all combinations of the matches are returned.

Let’s see that with `OrangeX`

and `OrangeY`

:

right_join(OrangeX, OrangeY)

Here `right_join()`

has restricted the output table to the 21 observations that are found in `OrangeY`

, and has added the column `circumference`

found in `OrangeX`

. `right_join()`

has also found that there are 7 observations that are common to `OrangeX`

and `OrangeY`

, where `Tree`

equals 3 and for which values in `age`

match too. Therefore, it displays these 7 common observations with values for both `circumference`

and `double_circumference`

. Note that the other 14 observations have now the value `NA`

in `circumference`

since none of them was found in `OrangeX`

.

`inner_join()`

returns a table with only the observations found BOTH in X and Y, and keeps all the variables found in X and Y. If there are multiple matches between X and Y, all combinations of the matches are returned.

Let’s see that with `OrangeX`

and `OrangeY`

:

inner_join(OrangeX, OrangeY)

Here `inner_join()`

has restricted the output table to the 7 observations that are found BOTH in `OrangeX`

and `OrangeY`

, and has kept all the variables found in `OrangeX`

and `OrangeY`

. It now displays these 7 common observations with values for both `circumference`

and `double_circumference`

.

`full_join()`

returns a table with all observations and all variables from both X and Y, regardless of potential matches. Where there are not matching values, `NA`

replaces the one missing.

Let’s see that with `OrangeX`

and `OrangeY`

:

full_join(OrangeX, OrangeY)

Here `full_join()`

has opened the output table to all 21 observations that are found in `OrangeX`

and all 21 observations found in `OrangeY`

. It has detected 7 common observations where `Tree`

equals 3 and for which values in `age`

match too, and has thus merged them. `full_join()`

has kept all the variables found in `OrangeX`

and `OrangeY`

. It now displays these 35 observations with values for both `circumference`

and `double_circumference`

, and with `NA`

whenever a value is missing from one of the tables.

The function `summarize()`

(which may also be written `summarise()`

) creates a table in which you will find the result(s) of the summary function(s) you have chosen to apply to a data frame. The summary functions may be:

`mean()`

: which returns the mean of a variable,`sd()`

: which returns the standard deviation of a variable,`median()`

: which returns the median of a variable,`min()`

: which returns the minimum value of a variable,`max()`

: which returns the maximum value of a variable,`var()`

: which returns the variance of a variable,`sum()`

: which returns the sum of a variable,- etc.

To apply one or more of these summary functions to a data frame, you just have to indicate in `summarise()`

which function(s) you want to apply and on which variable of the data frame. The syntax is:

summarise(dataframe, function1(variable), function2(variable), ...)

Alternatively, using pipes, the syntax is:

dataframe %>% summarise(function1(variable), function2(variable), ...)

Let’s use the data frame `Orange`

as an example. The top of the data frame looks like this:

head(Orange)

To calculate the mean and the standard deviation of the variable `circumference`

, we write either

summarise(Orange, mean(circumference), sd(circumference))

OR

Orange %>% summarise(mean(circumference), sd(circumference))

which both result in:

This example actually does not make much sense in terms of biology. Indeed, we have calculated the average of circumference for different trees, but considering measurements performed at 7 different time points… Instead we could calculate the average circumference and standard deviation for each time point described in `age`

by using `group_by`

on the variable `age`

(read more about `group_by`

here).

To calculate the group means and standard deviations of the variable `circumference`

, we write:

Orange %>% group_by(age) %>% summarise(mean(circumference), sd(circumference))

which results in:

Each line in the result table now shows the mean and standard deviation for each of 7 factors in `age`

described in the first column.

`group_by()`

is a function that groups the cases (rows) of the table according to the different factors of a chosen categorical variable. When used alone, it transforms a data frame into a new table where the factors of the chosen variable are registered as grouped. The output table is then very similar to the original dataframe. When used in combination with a second function in a pipe (read about pipes here), `group_by()`

splits the data frame by factor, applies the second function to each of the corresponding groups, and finally reassembles the data into a new table.

Let’s use the data frame `Orange`

as an example. The top of the data frame looks like this:

head(Orange)

Here, for example, we group `Orange`

by `age`

and store the result in the object `Orange_grouped_by_age`

:

Orange_grouped_by_age <- Orange %>% group_by(age)

As you see above here, the data look unchanged, but R says that there exist 7 groups for the variable `age`

(yellow box).

If we then decide to calculate the mean of `circumference`

for each factor of `age`

, we may do so by applying `summarise(mean(circumference))`

directly on `Orange_grouped_by_age`

:

Orange_grouped_by_age %>% summarise(mean(circumference))

We thus obtain a new table where the 7 rows show the mean of `circumference`

for each factor of `age`

.

For comparison, this is what the same code does when applied to `Orange`

(the original data frame without grouping):

Orange %>% summarise(mean(circumference))

Note that grouping is reversible, and that you may ungroup data in a table by using the function `ungroup()`

. In our example, simply type:

ungroup(Orange_grouped_by_age)

As you may see, the line that used to show the groups is now gone.

`Count()`

does exactly what it says: it counts the number of cases! Applied directly to a data frame, `count()`

will provide you with the number `n`

of cases. Applied to a table which has been pre-grouped with `group_by()`

(read more about `group_by()`

here) or in a pipe in combination with `group_by()`

, it will give you the number of cases `n`

for each group.

Let’s illustrate this with the data frame `Orange`

:

Orange %>% count()

The table shows the single value 35 which matches the number of observations in the original data frame.

Let’s now see what happens when we apply it in combination with `group_by()`

:

Orange %>% group_by(age) %>% count()

The result table shows indeed the number of observations for each factor of the variable `age`

.

And the same happens when we apply `count()`

to a pre-grouped table such as `Orange_grouped_by_age`

:

Orange_grouped_by_age <- Orange %>% group_by(age) Orange_grouped_by_age %>% count()

Finally, note that we may apply `count()`

to a table where a variable is already organised in groups while specifying another variable between the parentheses of `count()`

. In this case, the resulting table will show the number of cases for each of the combinations of variables. Let’s illustrate this again with the already-sorted table `Orange_grouped_by_age`

and let’s apply `count(Tree)`

:

Orange_grouped_by_age <- Orange %>% group_by(age) Orange_grouped_by_age %>% count(Tree)

And indeed, the table shows that there is only 1 observation per `age`

and per `Tree`

.

`dplyr`

has a handful of functions that allow for cleaning a data set by selecting a specific subset of observations. Here are the functions we will look at here:

`filter()`

: extract rows that meet logical criteria`slice()`

: extract rows by position`top_n`

: extract the rows containing the n highest/lowest values for a given variable`top_frac`

: extract a fraction of a data set where rows contain the highest/lowest values for a variable`sample_n`

: extract n randomly-selected rows`sample_frac`

: extract a fraction of randomly-selected rows`distinct()`

: keep only unique rows (remove duplicates)

Let’s use the data frame `Orange`

as an example. The top of the data frame looks like this:

head(Orange)

`filter()`

allows you to retrieve from a data frame the rows (observations) which match logical criteria. Logical criteria are operations for which the result is TRUE or FALSE and which contain logical operators such as `>`

, `<`

, `>=`

, `>=`

, `==`

, `!=`

, etc. For instance, to retrieve the observations that concern tree #4, we may write:

Orange %>% filter(Tree == 4 )

which results in:

And to retrieve the observations that were performed since age 1004:

Orange %>% filter(age >= 1004)

We may even combine logical criteria:

Orange %>% filter(age >= 1004 & Tree == 4)

`slice()`

allows you to retrieve from a data frame the rows (observations) which have a given position in this data frame.

Here is how to retrieve the rows 8, 9, 10 and 11, regardless of their content:

Orange %>% slice(8:11)

If `slice()`

is only provided with a single value, it will pick up the corresponding rows in the data frame:

Orange %>% slice(9)

If `slice()`

is provided with negative values, it will pick up the data frame but discard the corresponding rows:

Orange %>% slice(-1:-7)

`top_n()`

allows for retrieving the top or bottom *n* values in a data frame according to a given variable. If `top_n()`

is only given a positive value *n* but no variable, it picks up the top *n* observations while considering the last variable in the data frame (and will indicated it in red):

Orange %>% top_n(10)

As expected, the console returns the 10 observations for which `circumference`

(the last variable) is highest, and indicates “Selecting by circumference”.

When `top_n()`

is given a negative value *n*, it retrieves the bottom *n* values:

Orange %>% top_n(-10)

When `top_n()`

is provided with a variable, it will extract the observations for which the *n* values are highest/lowest in that variable:

Orange %>% top_n(-10, circumference)

Note that `top_n()`

may provide you with more than *n* rows than expected. If there are ties (i.e. several rows with equal values for the given variable), the function will give you all the observations matching the criteria. In the following example, we expect only 3 rows to be extracted by `top_n()`

:

Orange %>% top_n(3, age)

However the result tables comes with 5 observations for all of which `age`

equals 1582, the highest value in the variable.

`top_frac()`

works in a similar way to `top_n()`

. It sorts the data frame by a given variable, and retrieves the rows with the top or bottom values. This time, unlike `top_n()`

which retrieves a given number of rows, `top_frac()`

retrieves a given percentage/fraction of the number of rows. In the following example, we ask for the top 20% of the rows sorted by the variable `circumference`

:

Orange %>% top_frac(.2, circumference)

As expected we get 7 out 35 observations .

Again, if no variable is mentioned in `top_frac()`

, the last variable is considered. If a negative value is given, the bottom rows will be selected. And if there are ties, all the observations with equal values are provided:

Orange %>% top_frac(-.2, age)

Here for instance, we got the 10 bottom rows when we only expected 7.

`sample_n`

extracts a number *n* of rows which have been randomly picked from the data frame. Here is an example:

Orange %>% sample_n(5)

Since the selection is random, using twice the same code provides you with two different samples:

When randomly retrieving a large number of observations from a data frame, you may be given twice the same observation. `sample_n()`

avoids this by using *by default* the argument `replace = FALSE`

(i.e. you do not need to write the argument to make sure that all the picked observations are different). However, if, for some reason, you want to accept duplicates in your sample, you must add (`replace = TRUE`

).

`sample_frac`

extracts a given fraction/percentage *n* of rows which have been randomly picked from the data frame. Here is an example:

Orange %>% sample_frac(.5)

Again, since the selection is random, there is very little chance that using twice the same code will provide you with two identical samples.

When randomly retrieving a large fraction of a data frame, you may be given twice the same observation. `sample_frac()`

avoids this by using *by default* the argument `replace = FALSE`

(i.e. you do not need to write the argument to make sure that all the picked observations are different). However, if, for some reason, you want to accept duplicates in your sample, you must add (`replace = TRUE`

).

`distinct()`

is a function that checks the data frame for duplicate values for a (combination of) given variable(s) and thus returns only unique observations. In the following example, we will check `Orange`

and retrieve only unique observation based on the variable `circumference`

:

Orange %>% distinct(circumference)

Out of the 35 original observations, `distinct()`

has retrieved 30 rows, and thus discarded 5 rows where non-unique values were found in `circumference`

. Note that the present table only shows the value in the column `circumference`

, not the whole row. To keep the whole row and thus display all the corresponding variables, add the argument `.keep_all = TRUE`

to your code:

Orange %>% distinct(circumference, .keep_all = TRUE)

Both `Tree`

and `age`

now appear next to `circumference`

in the output table.

Doing the same for `Tree`

instead of `circumference`

reduces even more the output:

Orange %>% distinct(Tree, .keep_all = TRUE)

Only 5 observations have been kept by `distinct()`

, and you can see that these were not randomly selected, but were the first rows from the top (based on `age`

). You may thus be careful when using `distinct()`

to “clean” your data frame as you may discard rows with potentially interesting content based on the fact that they had duplicate values in another column…

You may combine variables and logical operators to “force” `distinct()`

to select another set of duplicate values when necessary. Here, we reuse the previous code, but add `age = 1372`

to keep only the duplicates for which `age`

equals 1372, instead of those containing 118:

Orange %>% distinct(Tree, age = 1372, .keep_all = TRUE)]]>

The `dplyr`

function `arrange()`

allows for reordering data frames and tables based on the content of one or more variables. The function is quite simple and sorts all variables in ascending order by default.

Here is an example where the variable `age`

is sorted:

Orange %>% arrange(age)

To sort the data frame in *descending order*, we shall add the helper function `desc()`

:

Orange %>% arrange(desc(age))

It is possible to sort a data frame according to more than one variable, in which case the sorting process follows the order of the variables between the parentheses of `arrange()`

. In addition, it is possible to decide for each variable whether the order will be ascending or descending:

Orange %>% arrange(desc(age), circumference)]]>