# 7. Building a dataframe from vectors/series

It is fairly easy to build a dataframe from scratch in R. All you need is a series of vectors and/or series containing your data and a good recipe! And very often, a good recipe is a simple one…

First you have to decide how many columns this dataframe will have, and what will their content be. Then, you need to create a vector for each column and give it a name. You must be clear about which of these vectors are factors and define them as such with the function `factor()`.  Finally, you store these vectors in the new dataframe using the function `data.frame()`.

NB: this recipe has been employed several times in the chapter ANOVA to create the dataframes used as example.

Let’s take an example: we want to build a dataframe storing data about 8 mice, 4 from a batch called “batchA” and 4 from “batchB”. There are 2 males and 2 females in each batch. The response data stored in the dataset in their weight as measured at both the beginning and the end of an experiment. Finally, these mice have “names” and these names are “mouse1”, “mouse2”, …, “mouse8”.

Step 1: define the names of the variables

Let’s make the column names clear from the beginning. We’ll have 5 columns:

• `mouse.ID`: the names of the individuals (“mouse1”, “mouse2”, …, “mouse8”),
• `batch`: the batch from which these individuals come (“batchA”, “batchB”),
• `gender`: the gender of the individual (“male”, “female”)
• `weight.beginning`: the weight of the individual as recorded at the beginning of the experiment,
• `weight.end`: the weight of the individual as recorded at the end of the experiment.

Out of these 5 columns, 2 correspond to response variables: `weight.beginning` and `weight.end`. 2 columns correspond to factors: `batch` and `gender`.` `The remaining column will simply contain the ID of the individual. Finally, we decide that the order of the columns is as follows: `mouse.ID``batch``gender``weight.beginning``weight.end`.

This is how the dataset look like in MS Excel, with the data:

Step 2: prepare the vectors

The vector `mouse.ID` contains the entries with “mouse” directly followed by a number ranging from 1 to 8. This can be simply coded in the following way:

[code language=”r”]
mouse.ID <- sprintf("mouse%d", 1:8)
mouse.ID
[/code]

The vector `batch` contains the entries with “batch” directly followed by A or B.  4 instances of each can be coded with `rep()` which allows replication of strings:

[code language=”r”]
batch <- c(rep("batchA",4), rep("batchB", 4))
batch
[/code]

The vector `gender` containing the entries “male” and “female” is easily coded with the function `rep()` which allows replication of strings. Here, we ask R to put two “male” and two “female”, and to repeat that once more:

[code language=”r”]
gender <- rep(c(rep("male",2), rep("female",2)),2)
gender
[/code]

The vectors `weight.beginning` and `weight.end` containing the two response variables may be entered as  simple numerical vectors, one for each variable:

[code language=”r”]
weight.beginning <- c(276, 285, 228, 232, 283, 289, 220, 223)
weight.end <- c(360, 364, 264, 270, 370, 361, 268, 269)
weight.beginning
weight.end
[/code]

At this point, we have all the vectors that we need. There is only one thing that we have to check before making the dataframe: that all vectors have the same number of items! If one of them is different, R will return an error message. To check this, use `length()``length()` return the number of items in the indicated vector:

[code language=”r”]
length(mouse.ID)
length(batch)
length(gender)
length(weight.beginning)
length(weight.end)
[/code]

All are equal to 8. We are good to go.

Step 3: set the factors

This is not a mandatory step per se but it is good practice to do it anyway since it may avoid problems.

Both `batch` and `gender` are categorical variables and may as such be recorded in the dataframe as factors. As of now, they are just character vectors as shown by `class()`:

We thus have to use `factor()` to do convert them to factors, and we use `class()` to check that they have been converted properly:

[code language=”r”]
batch <- factor(batch)
gender <- factor(gender)
class(batch)
class(gender)
[/code]

Step 4: build the dataframe

Now it is time to build the dataframe. We’ll use the function `data.frame()` to assemble all vectors as columns automatically and store the whole dataset under the name `my.experiment.df` (it is good practice, but not mandatory, to write `.df` at the end of the vector to remember that it is a dataframe):

[code language=”r”]
my.experiment.df <- data.frame(mouse.ID, batch, gender ,weight.beginning, weight.end)
my.experiment.df
[/code]

… and that’s it!

Step 5: check that everything is ok with a few simple functions

First, check that your dataframe `my.experiment.df` is indeed registered in R as a dataframe with `class()`:

[code language=”r”]
class(my.experiment.df)
[/code]

Then check its structure with `str()`:

[code language=”r”]
str(my.experiment.df)
[/code]

Note that the output tells you a lot here:

• the number of observations: 8
• the number of variables: 5
• the first 3 variables are factors with 8, 2 and 2 levels respectively
• the last 2 variables are numerical.

Ooops… need to add a column/variable?

The function `cbind()` allows you to add one more column to the dataframe. Create a vector for this column first, then add it to the existing dataframe and store it to a new dataframe.

[code language=”r”]
weight.at.death <- c(389, 424, 302, 298, 405, 415, 300, 272)
my.updated.experiment.df <- cbind(my.experiment.df, weight.at.death)
my.updated.experiment.df
[/code]

Ooops… need to add a row/observation?

The same is actually possible with rows. The function is `rbind()`. However the process is much more complex because the row to be added must contain the precise same names/types of variables than those in the target dataframe. You must then follow this procedure:

1. create a new dataframe (called `new.row.df`) that contains all the necessary variables and give the variables of `new.row.df` the names of the variables of the original dataframe `my.updated.experiment.df` via the functions `names()` which retrieves names and `colnames()` which gives names:
2. add the new row to the existing dataframe using `rbind()`.

Here is the code for phase 1:

[code language=”r”]
new.row.df <- data.frame("mouse9", "batchB", "male", 278, 350, 378)
colnames(new.row.df) <- names(my.updated.experiment.df)
new.row.df
[/code]

… and for phase 2:

[code language=”r”]
my.reupdated.experiment.df <- rbind(my.updated.experiment.df, new.row.df)
my.reupdated.experiment.df
[/code]

and indeed the last line contains the new observation, with all variables.

Fant du det du lette etter? Did you find this helpful?
[Average: 5]