Make New Variables

dplyr and tidyverse (a larger package which includes dplyr) give the possibility to add columns (and thus variables) to an existing data frame. You may either add a new column from new data (usually based on the content of a vector, the content of another table or data frame, or simply from a simple series of numbers) or by computing new columns based on the content of one or several columns from the existing data frame (e.g. sum or mean of several variables located on the same row).

Here, we will review two functions that add new columns from new/external data. You may read more about how to compute new columns based on the content of one or several columns from the existing data frame HERE.

NB: some functions are part of dplyr but others are part of tidyverse. Should any of the following functions return an error message such as “could not find function …”, simply activate the package tidyverse using:

library(tidyverse)

The two functions are:

add_column(): add a new column based on a series of numbers, the content of a vector, etc.
bind_cols(): add new columns based on the content of another data frame or table.
mutating join: add new columns based on the content of another data frame or table.

We also introduce a function which is useful when create new variables:

rename(): rename a variable.

Add a new column based on a series of numbers, the content of a vector, etc with add_column()

In its simplest form, add_column() creates a new column at the end of the existing data frame, and adds the variable name and the data that you provide between the parentheses.

Let’s use the data frame Orange as an example. The top of the data frame looks like this:

head(Orange)

Here we want to add a column called NEW and for which the data are the integers ranging from 1 to 35 (i.e. the series 1:35):

Orange %>% add_column(NEW = 1:35)

If we want to place that new column somewhere else than at the end of the table, we shall indicate the final position with .before =or .after =

Orange %>% add_column(NEW = 1:35, .after = "age")

Finally, it is possible to use a vector containing the data to add to the data frame. Here, we use a vector called vector that contains the series 1:35:

vector <- 1:35
Orange %>% add_column(NEW = vector)

Add new columns based on the content of another data frame or table with bind_cols()

The function bind_cols() adds the columns of the data frames indicated between parentheses next to each other, starting from the left. We will see how this works using Orange and a second data frame called df for which the code is:

df <- data.frame(x = 1:35, y = sample(1:1000, 35))
df

Note that df must have has as many rows as Orange has (see here what happens otherwise. Here is the top of the data frames:

To bind them together, we write:

bind_cols(Orange, df)

which results in the following:

As you may see above here, the left columns of the new table are those from Orange while the right columns are those from df.

You may bind together only a selection of variables from each data frame. To do so, supply the position of the variable in its respective data frame between []:

bind_cols(Orange[1:2], df[2])

Here, the first two variables of Orange have been added to the second row of df.

The examples above were based on two data frames which equal length (same number of rows). What happens when these are not equal? Let’s try with a shorter data frame df2 for which the code is:

df2 <- data.frame(x = 1:15, y = sample(1:1000, 15))
df2

We bind them with bind_cols():

bind_cols(df, df2)

and obtain the following warning:

Indeed, merging data frames requires that these have the same amount of rows.

Renaming variables with rename()

You may rename columns by using rename(). For each column, write first the new name of the variable followed by = and the original name:

bind_cols(Orange, df) %>% 
rename(Var1 = Tree, Var2 = age, Var3 = circumference, Var4 = x, Var5 = y)

Add new columns based on the content of another data frame or table by using a mutating join

There is a family of four functions that allows for joining variables from two tables X and Y while matching values to the rows they correspond to. In other words, the functions check whether there are common rows and columns, before putting the data together. Using some of these functions may thus not only add columns, but also add rows. This family of functions is described in details HERE.

Fant du det du lette etter? Did you find this helpful?