7. Building a dataframe from vectors/series


It is fairly easy to build a dataframe from scratch in R. All you need is a series of vectors and/or series containing your data and a good recipe! And very often, a good recipe is a simple one…

First you have to decide how many columns this dataframe will have, and what will their content be. Then, you need to create a vector for each column and give it a name. You must be clear about which of these vectors are factors and define them as such with the function factor().  Finally, you store these vectors in the new dataframe using the function data.frame().

NB: this recipe has been employed several times in the chapter ANOVA to create the dataframes used as example.

Let’s take an example: we want to build a dataframe storing data about 8 mice, 4 from a batch called “batchA” and 4 from “batchB”. There are 2 males and 2 females in each batch. The response data stored in the dataset in their weight as measured at both the beginning and the end of an experiment. Finally, these mice have “names” and these names are “mouse1”, “mouse2”, …, “mouse8”.

 

Step 1: define the names of the variables

Let’s make the column names clear from the beginning. We’ll have 5 columns:

  • mouse.ID: the names of the individuals (“mouse1”, “mouse2”, …, “mouse8”),
  • batch: the batch from which these individuals come (“batchA”, “batchB”),
  • gender: the gender of the individual (“male”, “female”)
  • weight.beginning: the weight of the individual as recorded at the beginning of the experiment,
  • weight.end: the weight of the individual as recorded at the end of the experiment.

Out of these 5 columns, 2 correspond to response variables: weight.beginning and weight.end. 2 columns correspond to factors: batch and gender. The remaining column will simply contain the ID of the individual. Finally, we decide that the order of the columns is as follows: mouse.IDbatchgenderweight.beginningweight.end.

This is how the dataset look like in MS Excel, with the data:

Skjermbilde 2016-09-22 11.59.25

 

Step 2: prepare the vectors

The vector mouse.ID contains the entries with “mouse” directly followed by a number ranging from 1 to 8. This can be simply coded in the following way:

[code language=”r”]
mouse.ID <- sprintf("mouse%d", 1:8)
mouse.ID
[/code]

Skjermbilde 2016-09-22 12.06.35

 

The vector batch contains the entries with “batch” directly followed by A or B.  4 instances of each can be coded with rep() which allows replication of strings:

[code language=”r”]
batch <- c(rep("batchA",4), rep("batchB", 4))
batch
[/code]

Skjermbilde 2016-09-22 13.39.35

 

The vector gender containing the entries “male” and “female” is easily coded with the function rep() which allows replication of strings. Here, we ask R to put two “male” and two “female”, and to repeat that once more:

[code language=”r”]
gender <- rep(c(rep("male",2), rep("female",2)),2)
gender
[/code]

Skjermbilde 2016-09-22 12.58.25

 

The vectors weight.beginning and weight.end containing the two response variables may be entered as  simple numerical vectors, one for each variable:

[code language=”r”]
weight.beginning <- c(276, 285, 228, 232, 283, 289, 220, 223)
weight.end <- c(360, 364, 264, 270, 370, 361, 268, 269)
weight.beginning
weight.end
[/code]

Skjermbilde 2016-09-22 13.07.03
At this point, we have all the vectors that we need. There is only one thing that we have to check before making the dataframe: that all vectors have the same number of items! If one of them is different, R will return an error message. To check this, use length()length() return the number of items in the indicated vector:

[code language=”r”]
length(mouse.ID)
length(batch)
length(gender)
length(weight.beginning)
length(weight.end)
[/code]

Skjermbilde 2016-09-22 13.40.48

All are equal to 8. We are good to go.

 

Step 3: set the factors

This is not a mandatory step per se but it is good practice to do it anyway since it may avoid problems.

Both batch and gender are categorical variables and may as such be recorded in the dataframe as factors. As of now, they are just character vectors as shown by class():

Skjermbilde 2016-09-22 13.48.00
We thus have to use factor() to do convert them to factors, and we use class() to check that they have been converted properly:

[code language=”r”]
batch <- factor(batch)
gender <- factor(gender)
class(batch)
class(gender)
[/code]

Skjermbilde 2016-09-22 13.51.10

 

Step 4: build the dataframe

Now it is time to build the dataframe. We’ll use the function data.frame() to assemble all vectors as columns automatically and store the whole dataset under the name my.experiment.df (it is good practice, but not mandatory, to write .df at the end of the vector to remember that it is a dataframe):

[code language=”r”]
my.experiment.df <- data.frame(mouse.ID, batch, gender ,weight.beginning, weight.end)
my.experiment.df
[/code]

Skjermbilde 2016-09-22 13.58.46

… and that’s it!

 

Step 5: check that everything is ok with a few simple functions

First, check that your dataframe my.experiment.df is indeed registered in R as a dataframe with class():

[code language=”r”]
class(my.experiment.df)
[/code]

Skjermbilde 2016-09-22 14.02.31
Then check its structure with str():

[code language=”r”]
str(my.experiment.df)
[/code]

Skjermbilde 2016-09-22 14.04.38

Note that the output tells you a lot here:

  • the number of observations: 8
  • the number of variables: 5
  • the first 3 variables are factors with 8, 2 and 2 levels respectively
  • the last 2 variables are numerical.

 

Ooops… need to add a column/variable?

The function cbind() allows you to add one more column to the dataframe. Create a vector for this column first, then add it to the existing dataframe and store it to a new dataframe.

[code language=”r”]
weight.at.death <- c(389, 424, 302, 298, 405, 415, 300, 272)
my.updated.experiment.df <- cbind(my.experiment.df, weight.at.death)
my.updated.experiment.df
[/code]

Skjermbilde 2016-09-22 14.17.42

 
Ooops… need to add a row/observation?

The same is actually possible with rows. The function is rbind(). However the process is much more complex because the row to be added must contain the precise same names/types of variables than those in the target dataframe. You must then follow this procedure:

  1. create a new dataframe (called new.row.df) that contains all the necessary variables and give the variables of new.row.df the names of the variables of the original dataframe my.updated.experiment.df via the functions names() which retrieves names and colnames() which gives names:
  2. add the new row to the existing dataframe using rbind().

Here is the code for phase 1:

[code language=”r”]
new.row.df <- data.frame("mouse9", "batchB", "male", 278, 350, 378)
colnames(new.row.df) <- names(my.updated.experiment.df)
new.row.df
[/code]

Skjermbilde 2016-09-22 18.34.39
… and for phase 2:

[code language=”r”]
my.reupdated.experiment.df <- rbind(my.updated.experiment.df, new.row.df)
my.reupdated.experiment.df
[/code]

Skjermbilde 2016-09-22 18.38.23

and indeed the last line contains the new observation, with all variables.

  Fant du det du lette etter? Did you find this helpful?
[Average: 5]