`dplyr`

has a handful of functions that allow for cleaning a data set by selecting a specific subset of observations. Here are the functions we will look at here:

`filter()`

: extract rows that meet logical criteria`slice()`

: extract rows by position`top_n`

: extract the rows containing the n highest/lowest values for a given variable`top_frac`

: extract a fraction of a data set where rows contain the highest/lowest values for a variable`sample_n`

: extract n randomly-selected rows`sample_frac`

: extract a fraction of randomly-selected rows`distinct()`

: keep only unique rows (remove duplicates)

Let’s use the data frame `Orange`

as an example. The top of the data frame looks like this:

head(Orange)

#### Extracting rows by criteria with filter()

`filter()`

allows you to retrieve from a data frame the rows (observations) which match logical criteria. Logical criteria are operations for which the result is TRUE or FALSE and which contain logical operators such as `>`

, `<`

, `>=`

, `>=`

, `==`

, `!=`

, etc. For instance, to retrieve the observations that concern tree #4, we may write:

Orange %>% filter(Tree == 4 )

which results in:

And to retrieve the observations that were performed since age 1004:

Orange %>% filter(age >= 1004)

We may even combine logical criteria:

Orange %>% filter(age >= 1004 & Tree == 4)

#### Extracting rows by position with slice()

`slice()`

allows you to retrieve from a data frame the rows (observations) which have a given position in this data frame.

Here is how to retrieve the rows 8, 9, 10 and 11, regardless of their content:

Orange %>% slice(8:11)

If `slice()`

is only provided with a single value, it will pick up the corresponding rows in the data frame:

Orange %>% slice(9)

If `slice()`

is provided with negative values, it will pick up the data frame but discard the corresponding rows:

Orange %>% slice(-1:-7)

#### Extracting a given number of rows containing the top/bottom values for a variable with top_n()

`top_n()`

allows for retrieving the top or bottom *n* values in a data frame according to a given variable. If `top_n()`

is only given a positive value *n* but no variable, it picks up the top *n* observations while considering the last variable in the data frame (and will indicated it in red):

Orange %>% top_n(10)

As expected, the console returns the 10 observations for which `circumference`

(the last variable) is highest, and indicates “Selecting by circumference”.

When `top_n()`

is given a negative value *n*, it retrieves the bottom *n* values:

Orange %>% top_n(-10)

When `top_n()`

is provided with a variable, it will extract the observations for which the *n* values are highest/lowest in that variable:

Orange %>% top_n(-10, circumference)

Note that `top_n()`

may provide you with more than *n* rows than expected. If there are ties (i.e. several rows with equal values for the given variable), the function will give you all the observations matching the criteria. In the following example, we expect only 3 rows to be extracted by `top_n()`

:

Orange %>% top_n(3, age)

However the result tables comes with 5 observations for all of which `age`

equals 1582, the highest value in the variable.

#### Extracting a fraction of a data set containing the top/bottom values for a variable with top_frac()

`top_frac()`

works in a similar way to `top_n()`

. It sorts the data frame by a given variable, and retrieves the rows with the top or bottom values. This time, unlike `top_n()`

which retrieves a given number of rows, `top_frac()`

retrieves a given percentage/fraction of the number of rows. In the following example, we ask for the top 20% of the rows sorted by the variable `circumference`

:

Orange %>% top_frac(.2, circumference)

As expected we get 7 out 35 observations .

Again, if no variable is mentioned in `top_frac()`

, the last variable is considered. If a negative value is given, the bottom rows will be selected. And if there are ties, all the observations with equal values are provided:

Orange %>% top_frac(-.2, age)

Here for instance, we got the 10 bottom rows when we only expected 7.

#### Extracting a given number of randomly selected rows with sample_n()

`sample_n`

extracts a number *n* of rows which have been randomly picked from the data frame. Here is an example:

Orange %>% sample_n(5)

Since the selection is random, using twice the same code provides you with two different samples:

When randomly retrieving a large number of observations from a data frame, you may be given twice the same observation. `sample_n()`

avoids this by using *by default* the argument `replace = FALSE`

(i.e. you do not need to write the argument to make sure that all the picked observations are different). However, if, for some reason, you want to accept duplicates in your sample, you must add (`replace = TRUE`

).

#### Extracting a given fraction of randomly selected rows with sample_frac()

`sample_frac`

extracts a given fraction/percentage *n* of rows which have been randomly picked from the data frame. Here is an example:

Orange %>% sample_frac(.5)

Again, since the selection is random, there is very little chance that using twice the same code will provide you with two identical samples.

When randomly retrieving a large fraction of a data frame, you may be given twice the same observation. `sample_frac()`

avoids this by using *by default* the argument `replace = FALSE`

(i.e. you do not need to write the argument to make sure that all the picked observations are different). However, if, for some reason, you want to accept duplicates in your sample, you must add (`replace = TRUE`

).

#### Removing all duplicate rows with distinct()

`distinct()`

is a function that checks the data frame for duplicate values for a (combination of) given variable(s) and thus returns only unique observations. In the following example, we will check `Orange`

and retrieve only unique observation based on the variable `circumference`

:

Orange %>% distinct(circumference)

Out of the 35 original observations, `distinct()`

has retrieved 30 rows, and thus discarded 5 rows where non-unique values were found in `circumference`

. Note that the present table only shows the value in the column `circumference`

, not the whole row. To keep the whole row and thus display all the corresponding variables, add the argument `.keep_all = TRUE`

to your code:

Orange %>% distinct(circumference, .keep_all = TRUE)

Both `Tree`

and `age`

now appear next to `circumference`

in the output table.

Doing the same for `Tree`

instead of `circumference`

reduces even more the output:

Orange %>% distinct(Tree, .keep_all = TRUE)

Only 5 observations have been kept by `distinct()`

, and you can see that these were not randomly selected, but were the first rows from the top (based on `age`

). You may thus be careful when using `distinct()`

to “clean” your data frame as you may discard rows with potentially interesting content based on the fact that they had duplicate values in another column…

You may combine variables and logical operators to “force” `distinct()`

to select another set of duplicate values when necessary. Here, we reuse the previous code, but add `age = 1372`

to keep only the duplicates for which `age`

equals 1372, instead of those containing 118:

Orange %>% distinct(Tree, age = 1372, .keep_all = TRUE)