Manipulating Data using dplyr: part 1

Overview

In this session, we will learn how to manipulate data and in particular, to select/modify/create columns, filter/order rows and calculate summaries. Knowing how to manipulate data is an essential skill that anyone working with data needs to acquire. We will focus on using the package dplyr, which is part of tidyverse, like ggplot2, and is probably the most commonly used package when it comes to this sort of data manipulation. I'm saying this sort of data manipulation, because we are not going to talk about more advanced manipulations like merging datasets or going from wide to long format.

This session is split into two parts, and you'll have a series of exercises and questions along the way and at the end.

In the first part - this workbook - we will mainly focus on learning how to use 5 of the 6 core functions of the package dplyr. These functions will help you perform all the data manipulations mentioned in the first sentence, that is:
- select columns - with select()
- filter rows - with filter()
- reorder rows - with arrange()
- calculate summaries - with summarise()
- create/modify columns - with mutate()
In the second part, we will learn how to efficiently combine all these functions and a couple of new ones, to perform all the complex sequences of manipulations that you will need in your work. In particular, we will learn to
- perform a sequence of manipulation by saving and reusing the output of functions
- use the function group_by() which will allow manipulations like summaries or filters by groups of observations
- use the pipe %>%, an operator similar to the + in ggplot, that makes it very easy to perform cascades of manipulations on a data frame

Let's start with a video where I give you more information about data manipulation, the package dplyr and its core functions.

loading dplyr

dplyr is an additional package that needs to be installed and then loaded.

We haven't learnt how to properly do that yet, but in this workbook dplyr is already pre-loaded, so we're good to go and learn about the core functions of dplyr!

If you do wish to learn more about how to install packages and use Rstudio you can check out another module here

the dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column	Description
ID	Farmer ID
VILLAGE	Village name
HHTYPE	Household composition
GENDERHH	Gender of Household Head
AGEHH	Age of Household Head
OCCUHH	Occupation of Household Head
ADULTS	Number of Adults within the household
CHILDREN	Number of Children (<18) within the household
MATOKE	Do they grow matoke?
MAIZE	Do they grow maize?
BEANS	Do they grow beans?
BANANA	Do they grow banana?
CASSAVA	Do they grow cassava?
COFFEE	Do they grow coffee?
LANDAREA	Land area of farm (acres)
LABOR	Labor usage
INTERCROP	Intercrops with beans
DECISIONS	Household decision responsibility
SELLBEANS	Do they grow beans for sale?
BEANSPLANTED_LR	Quantity of beans planted in long rain season
BEANSPLANTED_SR	Quantity of beans planted in short rain season
BEANSHARVESTED_LR	Quantity of beans harvested in long rain season
BEANSHARVESTED_SR	Quantity of beans harvested in short rain season

Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".

BeanSurvey

select()

The select() function allows you to retrieve columns from a dataset. It's not the most important function of dplyr, but it is probably one of the simplest. Inside the function you first indicate the dataset and then the columns you want to keep

Let's first retrieve the column AGEHH.

select(BeanSurvey, AGEHH)

You will see that all the functions of dplyr are based on the same basic syntax where the first argument is a data frame. Here, our data frame is called 'BeanSurvey', so the first thing we write inside the brackets of the function is 'BeanSurvey'. Then we indicated the name of the column to retrieve, that is 'AGEHH'.

To retrieve more than one column, simply list all the columns you want, separated with commas.

select(BeanSurvey, VILLAGE, GENDERHH, AGEHH, LANDAREA)

Don't forget that R is case sensitive by the way. If you spell the name of a column wrongly, it will not work! Can you spot what is wrong in the command below? Make the necessary corrections to make it work, and check the solution if you struggle.

select(BeanSurvey, VILLAGE, GENDERH, LANDAREA)

# 'village' should be written with capital letters!
# If you correct it, it will still not work, because the column GENDERH doesn't exist.
# it should be 'GENDERHH' (with two 'H' instead of just one)
# Here is the correct command:
select(BeanSurvey, VILLAGE, GENDERHH, LANDAREA)

select() follows your instructions, so if you change the order of the columns in your command, the order in the output will also change.

select(BeanSurvey, GENDERHH, LANDAREA, VILLAGE, ID)

The first column of our dataset, ID, is now showed last!

When you want to retrieve lots of columns, it quickly becomes painful to list them all individually. Instead, you can use a colon : which asks R to retrieve all the columns that are positioned between the columns you indicate to the left and right of the colon. You could translate the colon by to and MATOKE:COFFEE by "all the columns from MATOKE to COFFEE" . So in the following command, we are asking R to retrieve all the columns from MATOKE to COFFEE:

select(BeanSurvey, MATOKE:COFFEE)

And you can combine both methods to try and get all the columns you want in the least amount of effort.

select(BeanSurvey, OCCUHH, MATOKE:COFFEE, INTERCROP)

Note that if you struggle finding the names of your columns, colnames() is a quick way to list them all:

colnames(BeanSurvey)

Question: Shorten the following command by using the "colon" operator whenever it's possible

select(BeanSurvey, VILLAGE, HHTYPE, GENDERHH, AGEHH, MATOKE, MAIZE, BEANS, BANANA, CASSAVA, COFFEE, LANDAREA, INTERCROP)

select(BeanSurvey, VILLAGE:AGEHH, MATOKE:LANDAREA, INTERCROP)

select also has some helper functions that we can use inside the select function, to choose columns based on criteria. Some useful ones to know here would be starts_with, ends_with and contains. For example if we wanted to select just the bean yield variables and the BEANS variable

select(BeanSurvey, starts_with("BEANS"))

filter()

When you want to retrieve specific rows rather than columns, you use the function filter(). It is a function that you will use very often, for example, to filter out observations that appear to be of bad quality or that are not relevant for your analysis.

The way it works is similar to select(): we write the dataset first, and then we indicate the rows that we want to retrieve. Except that our rows don't have names, so we use conditions on some of our columns instead. In the command below, the expression ADULTS>3 tells R that we want all the rows for which the column ADULTS has a value greater than 3. That is, we are asking R to keep all the households that have more than 3 adults

filter(BeanSurvey, ADULTS>3)

As you can see, R returns us all the columns, but the number of rows is now 5 instead of 50.

In R, the syntax to check if a value is greater or lower than another value is intuitive. You use the symbols > and < like we just did. However, to check if a value is EQUAL TO another value, you need to use ==, not =. That's because a single equal has another use:

A single equals sign is a statement. When you write x=y, it sets x to be equal y.
A double equals sign is a question. When you write x==y, you're asking R Is x equal to y?. If the response is yes, the condition is verified.

So to retrieve all the households for which the household head is of gender female, we use the double equals sign.

filter(BeanSurvey, GENDERHH=="female")

In the command above R will check all the rows of the dataset BeanSurvey and returns the ones where the column GENDERHH takes the value "female". Note the quotes around female. Whenever you write a string of text that is not an object or a column, you need to encapsulate it between quotes. Otherwise R will try to interpret it as an object or the name of a column.

Also note that in R, you will not always get an error when you make a mistake. For example, if you write "female" with a capital F, you will just get no result:

filter(BeanSurvey, GENDERHH=="Female")

That's because even though R does not give you the answer you want, the command you wrote is totally valid. You're asking R to retrieve all the rows where Gender takes the value "Female". There are none, because R is case sensitive and female is always written with a lower case f in our dataset. If you were to write "GENDER" instead of "GENDERHH" on the left of the double equals sign though, you would get an error, because there's no column named "GENDER", so R cannot check the condition.

Using multiple conditions

You can also use multiple conditions and additional functions to filter rows. Here are the main logical symbols that you can use when building conditions in R:

== means EQUALS
!= means DIFFERENT
< means LESS THAN
> means GREATER THAN
<= means LESS THAN OR EQUAL TO
>= means GREATER THAN OR EQUAL TO

& means AND
| means OR
! means NOT

If you're not familiar with the use of logical operations, have a look at the first 3 minutes of this video: R Tutorial - Logical Operators and Vectors in R

Let's use multiple conditions in a couple of examples. One thing you should do to check the quality of your data is to inspect it for inconsistency. For example, it would make no sense to have households that 'grow beans for sale', but 'don't grow beans'! That is we can't have at the same time the value of the column SELLBEANS be "yes" and the value of the column BEANS be "no". Lets' check that no row is satisfying this absurd double condition with filter():

filter(BeanSurvey, SELLBEANS=="Yes" & BEANS=="No")

In the command above, we separated the two condition with &, which is the symbol used in R for 'AND', because we want to retrieve the rows where SELLBEANS=="Yes" AND BEANS=="No". R did not retrieve any row, which is good. Don't forget that R is case sensitive though. In our data, the value "Yes" is written with an upper case "Y" and a lower case "es", and the value "No" is written with an uppercase "N" and a lowercase "o". It would be very easy to misspell these two values in our command, which would make us reach the same conclusion that there is no inconsistency in our data, but without having tested it properly!

filter can also be useful to explore a specific subset of the population. For example, I saw that the head of household of most households is either a farmer or a fisherman. I am a bit curious about the households who are not in this situation. We can retrieve them using the following command:

filter(BeanSurvey, !(OCCUHH=="Farmer" | OCCUHH=="Fisherman"))

The command above looks a bit complicated, but it is actually quite simple. Let's break it into steps.

The symbol | represents "OR", so the following command would retrieve all the households for which the head of household is a farmer OR a fisherman:

filter(BeanSurvey, OCCUHH=="Farmer" | OCCUHH=="Fisherman")

Brackets are used to group expressions, in order to force R to perform operations in a specific order, exactly as in basic mathematical operations. Encapsulating the whole expression within brackets doesn't change the output of our command:

filter(BeanSurvey, (OCCUHH=="Farmer" | OCCUHH=="Fisherman"))

Finally we use the symbol ! that represents "NOT", to only keep the rows where the head of household is NOT either a farmer or a fisherman.

filter(BeanSurvey, !(OCCUHH=="Farmer" | OCCUHH=="Fisherman"))

The brackets are important here, because in boolean logic "NOT" takes priority over "OR", so using the expression !OCCUHH=="Farmer" | OCCUHH=="Fisherman" we would retrieve the rows where the head of household is not a farmer, OR the head of household is a fisherman. This type of priority is similar to the priority of multiplication over addition or subtraction. In maths, the result of the calculation 3 x 2 + 2 is 8. To have the addition 2 + 2 performed before the multiplication 3 x 2, we need to add brackets: 3 x (2 + 2). It is the same with NOT and OR.

Also note that we could have retrieved our households in another way, using "&" and "!=" (AND and IS DIFFERENT) instead of "|" and "!" (OR and NOT):

filter(BeanSurvey, OCCUHH!="Farmer" & OCCUHH!="Fisherman")

Indeed, the command above is asking R for the rows where

the occupation of the head of household IS DIFFERENT from farmer AND the occupation of the head of household IS DIFFERENT from fisherman.

There are often many different ways to get to the same result in R. We believe that the commands we show you in these workbooks are good ones to work efficiently and have a readable code, but if for some reason you want to use different commands, you can do so!

using functions within a condition statement

In a condition statement, we can also use functions like max() or min() to help us with the filtering. Let's retrieve the households whose land area of farm is the largest:

filter(BeanSurvey, LANDAREA==max(LANDAREA))

In the command above, R first calculates the result of max(LANDAREA) in the dataframe BeanSurvey, which is 10:

max(BeanSurvey$LANDAREA)

And it then performs the corresponding filtering:

filter(BeanSurvey, LANDAREA==10)

As you see, we didn't get one single result here. That's because filter retrieves all the rows that satisfy the specified condition, and we have several households whose farm land area is 10 acres.

Question: write the command that would show the household that has the highest quantity of beans planted in long rain season. Hint: there is a missing value in the column

filter(BeanSurvey, BEANSPLANTED_LR==max(BEANSPLANTED_LR, na.rm=TRUE))
# if we omit the argument 'na.rm=TRUE' in the function max(), we end up with no household.
# that's because the missing value makes max(BEANSPLANTED_LR) be unknown/missing as well.

arrange()

Let's relax a little and have a brief look at the arrange() function, which is used to order the rows of a data frame according to the values of some columns. The syntax is very similar to that of select(), as after we enter the dataset, we simply indicate the column(s) by which we want to order the dataset. Let's order our dataset by land area of farm:

arrange(BeanSurvey, LANDAREA)

Depending on the size and resolution of your screen, you may need to click the little arrow at the top right to reach the column LANDAREA and check that the data has indeed been ordered by that variable. You will see that the households with smallest land area are at the top. That's because by default, arrange() orders your dataset by increasing values of the indicated column. If we want to order from highest to lowest value instead, we need to place our column inside the function desc() - "desc" for descending

arrange(BeanSurvey, desc(LANDAREA))

Note that we can order a dataset by multiple columns and these columns don't necessary have to be numeric. Let's order our dataset by age group, gender of the head of household, and decreasing order of the number of children respectively:

arrange(BeanSurvey, AGEHH, GENDERHH, desc(CHILDREN))

The households where the head of household are between 20 and 30 years old are now at the top, the one with a female head of household being first. And for the male headed households, the higher the number of children, the higher they place in the table.

As this section is a bit shorter than the other ones, let me briefly introduce you to the function slice(), which is quite useful to quickly retrieve specific rows by position. The syntax is similar to all the core functions of dplyr. The data is the first argument, and then we indicate the rows that we want. Since 1:5 is a shorthand for c(1,2,3,4,5), we can retrieve the first 5 rows of our dataset with the following command:

slice(BeanSurvey, 1:5)

slice() is often useful after arrange(), to retrieve the few rows that have the highest or lowest values for the column used to order the rows. Combining manipulations like that is what we'll learn in the second part of the workbook. Stay tuned!

Question: Use the function arrange() to order the rows of the BeanSurvey dataset by decreasing order of the number of adults in the household

arrange(BeanSurvey, desc(ADULTS))

summarise()

The next function we will learn about is summarise(), which calculates summaries of variables within our dataset. As with all the other dplyr functions, the first argument is the name of our data. The second argument provides a summary function used on a variable from the data.

For example, to know the total area of all the household farms in the BeanSurvey data frame, we can use the summary function sum():

summarise(BeanSurvey, sum(LANDAREA))

This seems equivalent to doing:

sum(BeanSurvey$LANDAREA)

Except that with summarise(), the output is a data frame, which is extremely useful if we want to use our summaries later, or combine our command with other manipulations. We should also get used to give a proper name to the output column containing our summary statistics to make it easy to reuse it in later steps. To do so, we place the name we would like for our summary followed by a single equals sign just in front of our calculation.

summarise(BeanSurvey, totalArea=sum(LANDAREA))

Note that we use the single equals sign, not the double equals sign, because we are assigning our summary statistics sum(LANDAREA) to a column called totalArea.

Remark: You may sometimes see people use summarize() instead of summarise() like I do in the video. They are identical. The people who made dplyr are from New Zealand, where they use the British spelling summarise(). But they are very nice people and decided to allow for the American spelling as well.

We can ask R to give us more than one summary, by listing the calculations we want to perform, separated with commas.

summarise(BeanSurvey, households=n(), mean_area=mean(LANDAREA), sd_area=sd(LANDAREA))

Here we used the functions mean() and sd() to calculate the mean, and standard deviation of the column LANDAREA. We also used the function n(), which simply counts the number of rows - that is, here, the number of households. Note that with n() we don’t put a name of variable inside the brackets, since the number of rows does not depend on the variables.

So far, we've only used numeric variables in our calculations, but we can also use factors or categorical variables. For example, it is often useful to count the number of occurrences of a certain value using conditions, a bit like we did in the filter section. Let's try to count the number of female headed and male headed households:

summarise(BeanSurvey, female_HH=sum(GENDERHH=="female"), male_HH= sum(GENDERHH=="male"))

Argh, we obtain the value 'NA'! We've discussed the issue of R not being able to calculate some summary statistics like a mean, sum or maximum when there's a missing data in the variable, unless we add the argument na.rm=TRUE. We've not discussed how to check the presence of missing values in the first place though. One good way to do that would be to use filter() together with the base-R function is.na().

Let's see what is.na() does when we apply it to our gender variable:

is.na(BeanSurvey$GENDERHH)

It gives us a vector that take the value TRUE when GENDERHH is missing and false otherwise.

We can then use it inside filter() to keep only the rows where is.na(GENDERHH) is equal to TRUE:

filter(BeanSurvey, is.na(GENDERHH)==TRUE)

Here we go, this household in Kimbugu was the one where the gender of the head of household was missing! Note that we didn't place any quote around TRUE, which may be confusing for some of you, as we told you a few times that you should always have quote around strings of text that are not existing objects or variables within R. Well that's because within R, TRUE and FALSE are recognized as values rather than strings of text. Don't worry, we'll talk more about it a bit later.

So now that we confirmed that there is a missing value in the variable GENDERHH, we can tell R to skip it using the argument na.rm=TRUE within the sum() function:

summarise(BeanSurvey, female_HH=sum(GENDERHH=="female", na.rm=TRUE), male_HH= sum(GENDERHH=="male", na.rm=TRUE))

We have our summaries!

Question: Use summarise to calculate the number of households growing beans, maize and coffee. Give sensible names to these numbers

summarise(BeanSurvey, grow_beans = sum(BEANS=="Yes"), grow_maize = sum(MAIZE=="Yes"), grow_coffee = sum(COFFEE=="Yes"))

summarise() is especially useful when it is used in combination with the function group_by(), as the latter allow us to calculate summaries for groups of observations rather than for the whole dataset. We will learn how to do that a bit later in the second part of this workbook.

mutate()

The next function to look at is mutate(). It is used to modify existing columns or create new columns. The syntax is as follows: We indicate our dataset first, as always, and then we provide the calculations that we want to perform for our new columns.

For example, it would probably be useful to create a new column representing the size of each household, by adding the number of children and the number of adults:

mutate(BeanSurvey, hh_size = ADULTS + CHILDREN)

Like for the summarise function, we could omit naming our new column, but then R would just use the calculation itself as the column name. That's what happened in the video. We don't always set a good example!

You can see that the new column hh_size appears at the very end of the dataset (use the little black arrow at the top right to reach it). We could use select() to move it more towards the beginning of our columns if we knew how to combine multiple manipulations. We will see how to do that shortly.

Similarly to summarise(), you can use functions like sum(), min(), max() , or calculations over conditions to help you create your new columns. Let's create a variable that tells us whether or not a household has children:

mutate(BeanSurvey, has_children = CHILDREN > 0)

In the command above, when the column CHILDREN is greater than 0, the expression CHILDREN > 0 returns TRUE. Otherwise, it returns FALSE. If you look just below the name of our new variable (click the black arrow to get to the right end of the dataset), you will see a set of weird grey characters. It indicates the type of the variable. Here the set of characters is <lgl> and "lgl" is a shorthand for 'logical'. That's because we just created a variable of type 'logical'. The only values that a logical variable can take are 'TRUE' and 'FALSE', which are special values used by R to determine if a condition is true or false. We had already created a logical variable a few commands ago when we checked if there was any missing value in the variable GENDERHH using is.na(GENDERHH), though we did not save it. With mutate, we keep it in our dataset!

Note that you have a similar set of grey characters below every column name, and <chr>, <int>, <dbl> indicate that the corresponding variables are of type 'character', 'integer' and 'double' respectively, the two latter being different types of numeric variables. The operations and functions that you can apply to a variable depends on their type, so this information is very useful, especially to understand why you get errors or things don't go as expected. For example, whenever you try to make a calculation with a character variable, R will give you an error.

creating categorical variables using ifelse()

So far, we've created a numeric column (hh_size was of type integer) and a column of type logical. Creating a proper categorical variable is a bit less straightforward. Let's see one nice way to do so, using the base-R function ifelse(), which checks every row for a condition and returns a value that depends of this condition. ifelse() takes three arguments. A condition to verify, a value that is returned if the condition is verified, and a value returned if the condition is not verified. So the syntax is:

ifelse(CONDITION, VALUE IF CONDITION IS TRUE, VALUE IF CONDITION IS FALSE)

Let's try with a simple example. We'll create a simple column hh_occupation, that takes only two values: "Farmer" or "Other". As a reminder, here are the values that the column OCCUH takes in our dataset:

BeanSurvey$OCCUHH

To generate a variable that takes the value "Farmer" if the occupation of the head of household is farmer, and "other" otherwise, we can use the following command:

ifelse(BeanSurvey$OCCUHH == "Farmer", "Farmer", "Other")

And we can do the same, but inside the function mutate, to add this variable to our dataset. We need to remove "BeanSurvey$" from the command though, since the data is already indicated as the first argument of mutate.

mutate(BeanSurvey, hh_occupation = ifelse(OCCUHH== "Farmer", "Farmer", "Other"))

creating multiple columns and replacing columns

Of course, we can create multiple columns at once. We simply need to separate the associated calculations with commas.

mutate(BeanSurvey, hh_size = ADULTS + CHILDREN, has_children = CHILDREN > 0, hh_occupation = ifelse(OCCUHH== "Farmer", "Farmer", "Other"))

Also if you remember, I said that mutate can also "modify" a column. You achieve that by using an already existing column name as the name of your new column. Here we replace the variable CHILDREN rather than creating the variable has_children:

mutate(BeanSurvey, CHILDREN = CHILDREN > 0)

You need to be careful when modifying existing columns though. In general, as it's not too much of a pain to have lots of columns - since we have the function select() to pick the ones we want - we often prefer to create new columns rather than modifying existing ones.

Question: create a column that measures the yield of beans per acre of each household in long rain season. Give it a sensible name

mutate(BeanSurvey, yield_per_acre_LR = BEANSHARVESTED_LR/LANDAREA)

Such new column will be useful for analysing our dataset, but we still don't know how to associate the result of our manipulations with each other and other operations. Time for the second part of the dplyr session!

Appendix: Useful reference links

The official dplyr documentation: https://dplyr.tidyverse.org/

dplyr CheatSheet:https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf

Data Manipulation Tools - Rstudio video: dplyr -- Pt 3 Intro to the Grammar of Data Manipulation with R

Some documentation on subsetting r-objects using base-R: https://bookdown.org/rdpeng/rprogdatascience/subsetting-r-objects.html