Manipulating Data using dplyr: part 2

Combining manipulations

We've now learnt to use most of the core functions of dplyr. But their use is greatly limited by the fact that we still don't know how to combine them together. You can review the first tutorial at any time

As explained in the video, if we don't store the results of our commands, there is no way to re-use them. This is actually true for the great majority of R commands, not just those involving the core functions of dplyr. So to perform a sequence of manipulations, we need to either:

store the result of each manipulation as a data frame, to then make it the first argument of the next function.
combine all the manipulations we want to perform into one single command using the pipe operator.

You already know everything there is to know to perform a sequence of manipulations via the first option. For example, if we wanted to calculate a couple of summary statistics on the household living in the Lwala village, we could do the following:

BeanSurvey_Lwala <- filter(BeanSurvey, VILLAGE=="Lwala")
summarise(BeanSurvey_Lwala, households=n(), mean_land=mean(LANDAREA), grow_beans= sum(BEANS=="Yes"))

First we use filter() to keep only the households of "Lwala". We store the result as a new object, called "BeanSurvey_Lwala". Nothing gets printed at this stage. Then we use the summarise() command, but with the newly created object as the first argument instead of the full BeanSurvey dataset.

Question: change the command to get summaries for the households in Kimbugu rather than Lwala. Give a sensible name to the intermediate data frame

BeanSurvey_Lwala <- filter(BeanSurvey, VILLAGE=="Lwala")
summarise(BeanSurvey_Lwala, households=n(), mean_land=mean(LANDAREA), grow_beans= sum(BEANS=="Yes"))

BeanSurvey_Kimbugu <- filter(BeanSurvey, VILLAGE=="Kimbugu")
summarise(BeanSurvey_Kimbugu, households=n(), mean_land=mean(LANDAREA), grow_beans= sum(BEANS=="Yes"))

But what if there were say, 20 villages? Doing this for each village would be very time consuming. Don't worry, there is a much better approach, using group_by(), the last core function of the package dplyr.

group_by()

group_by() tells R to separate a dataset into groups, based on the values of a column. All the subsequent operations performed on the resulting "grouped" dataset are applied to each group rather than to the whole dataset. For the syntax, we indicate the dataset first, as usual, and then we indicate the column whose values will define our groups. Let's group our dataset by village

group_by(BeanSurvey, VILLAGE)

Let's see... 50 rows, 23 columns, original order of these rows and columns... Well it looks like nothing happened to our dataset. But it's just because the grouping is invisible. We need to apply another function to see the effect of group_by(). Let's store our grouped data frame in an object called say, "BeanSurvey_ByVillage", and let's use this object as the first argument of the function summarise()

BeanSurvey_ByVillage <- group_by(BeanSurvey, VILLAGE)
summarise(BeanSurvey_ByVillage, households=n(), mean_land=mean(LANDAREA), grow_beans= sum(BEANS=="Yes"))

Yes, now our summarise() command returns two rows instead of one: one row of summaries per village!

The effect of group_by() on the result of summarise() is very intuitive. We obtain the calculated summaries for each of the groups defined by the function group_by(). At first, it might be slightly less obvious that group_by() is also very useful in combination with filter() or mutate(). But consider the case where we would like to retrieve for each village the information of the household that has harvested the highest quantity of beans in long rain season. With filter, we can easily get the household that has harvested the highest quantity of beans during the long rain season in the full dataset:

filter(BeanSurvey, BEANSHARVESTED_LR==max(BEANSHARVESTED_LR, na.rm=TRUE))

Note that there is one value that is missing in the column BEANSHARVESTED_LR, so we need to use na.rm=TRUE like for the function mean() earlier.

But the highest production of beans during long rain season comes from a household in Kimbugu. In Lwala, the highest production is smaller than that and so it is not captured by our filter function.

Using group_by() first, and then filter() would restrict the scope of BEANSHARVESTED_LR==max(BEANSHARVESTED_LR, na.rm=TRUE) to each village, and so if a household has the highest production in this village, filter will retrieve it.

BeanSurvey_ByVillage <- group_by(BeanSurvey, VILLAGE)
filter(BeanSurvey_ByVillage, BEANSHARVESTED_LR==max(BEANSHARVESTED_LR, na.rm=TRUE))

Question: Calculate the number of household and average land area for each type of household composition (column HHTYPE). What do we seem to see?

BeanSurvey_ByHHType<-group_by(BeanSurvey, HHTYPE)
summarise(BeanSurvey_ByHHType, households=n(), averageArea=mean(LANDAREA))
# There are two main household composition in the dataset: Female headed, no husband (13 households), and Male headed one wife (27 households)
# The households of the second type seem to have the largest lands in average

We have learnt about the very useful group_by() function and we have a descent way to combine two manipulations together. So far, so good. But what if we wanted to perform more than two manipulations? What if, for example, we wanted to keep in our dataset only those household who grow beans and then, to calculate for each village the average yield per acre in long rain season? It doesn't sound too complicated, but we need to use four functions:

filter() to get rid of the households that don't grow beans
mutate() to calculate the yield per acre of each household
group_by() to group our data by village
summarise() to calculate the average yield per acre by village.

And after each manipulation, we need to save the result as a new data frame that will be used as the input for the next function.

Remark: at any point in your code you can undo the grouping in your data by using the ungroup() function. You do not have to write anything inside the brackets if you wish to get rid of all groupings, alternatively you can write in specific grouping variables if you wish to only undo some groupings but not all.

This would do the job:

BeanSurvey_filtered <- filter(BeanSurvey, BEANS=="Yes")
BeanSurvey_mutated <- mutate(BeanSurvey_filtered, yield_per_acre = BEANSHARVESTED_LR/LANDAREA)
BeanSurvey_grouped_by<- group_by(BeanSurvey_mutated, VILLAGE)
summarise(BeanSurvey_grouped_by, households=n(), avg_yield_per_acre=mean(yield_per_acre, na.rm=TRUE))

The code above starts to be quite messy, with lots of intermediate data frames that we are not really interested in. One thing you may suggest to simplify our set of commands is to have only one intermediate data frame, that we overwrite. Something like that:

temp_data <- filter(BeanSurvey, BEANS=="Yes")
temp_data <- mutate(temp_data, yield_per_acre = BEANSHARVESTED_LR/LANDAREA)
temp_data <- group_by(temp_data, VILLAGE)
summarise(temp_data, households=n(), avg_yield_per_acre=mean(yield_per_acre, na.rm=TRUE))

It looks slightly simpler maybe, and show you that when creating objects with <-, it makes no difference whether the name of the object is new or not. If it is not new, R will just overwrite the old object.

But this way of overwriting objects over and over is definitely not good practice, as in some situations you may end up loosing valuable data. We don't need to use such approach though. We can make our command much more clean and readable if we use the pipe operator!

pipe %>%

The symbol used for the pipe operator in R is %>%, that is a symbol greater than > surrounded by two percentages %. This operator is extremely useful because it makes it possible to perform a sequence of data manipulations using dplyr functions, without having to create any intermediate data frame. This is due to the consistent syntax of these dplyr functions, and in particular, the fact that their first argument is always the data fame that we want to manipulate.

Because what the pipe operator does is to tell R

take what's on my left, and make it the first argument of the next function on my right (or below me)

So if in the command thing1 %>% thing2, thing1 is a data frame and thing2 is a dplyr function, the pipe operator will ask R to make the data frame the first argument of the dplyr function. And R will happily perform the corresponding manipulation on the data frame since it results in a valid command.

BeanSurvey %>% filter(BEANS=="Yes")

In the above commands, the pipe operator asks R to take what's on its left - the data frame BeanSurvey - and to make it the first argument of what's on its right - the function filter(). The command is therefore equivalent to

filter(BeanSurvey, BEANS=="Yes")

Instead of placing the function filter to the right of the pipe, we can, and usually should place it below the pipe, with a little indentation, similar to what you do with the + in ggplot2. It's good practice for readability, and it doesn't change anything. R will see the pipe and look for the next command. This command happens to be below the pipe rather than on its right.

BeanSurvey %>% 
  filter(BEANS=="Yes")

What is great with pipes is that the what's on my left can well be a command itself, if the result of the command is a data frame. So we can redo the last commands of the previous section, using pipes.

Our commands were

BeanSurvey_filtered <- filter(BeanSurvey, BEANS=="Yes")
BeanSurvey_mutated <- mutate(BeanSurvey_filtered, yield_per_acre = BEANSHARVESTED_LR/LANDAREA)
BeanSurvey_grouped_by<- group_by(BeanSurvey_mutated, VILLAGE)
summarise(BeanSurvey_grouped_by, households=n(), avg_yield_per_acre=mean(yield_per_acre, na.rm=TRUE))

Using pipes it becomes:

BeanSurvey %>% 
  filter(BEANS=="Yes") %>%
    mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
      group_by(VILLAGE) %>%
        summarise(households=n(), avg_yield_per_acre=mean(yield_per_acre, na.rm=TRUE))

We start with the dataset BeanSurvey. The pipe next to it will make it the first argument of the function filter() that follows. The next pipe makes the resulting command the first argument of the function mutate(). The next pipe takes the result of all of this and make it the first argument of the next function, which is group_by(). And the last pipe makes the resulting data frame, the first argument of the function summarise(). Here we go. We have a neat command that doesn't require the creation of intermediate data frames! Note that when using pipes, the output from the previous line always takes the place of the ‘data’ argument. So when using the commands with the pipes, we skip straight to the second argument.

And that's where things start to be very interesting. Because with pipes, it is not a pain anymore to perform a long sequence of manipulations. So we can really start to have fun!

Question: Find the pipe equivalent of the command below

filter(BeanSurvey, OCCUHH!="Farmer")

BeanSurvey %>%
  filter(OCCUHH!="Farmer")

More pipes!

Pipes are great, but they require some time to get used to it. So let's practice and learn a few more things along the way.

Something I'm wondering is whether the household composition vary by village and gender of household head. I didn't mention it earlier, but we can group by more than one column. We just need to list the corresponding columns within the group_by() function, separated with commas:

BeanSurvey %>%
  group_by(VILLAGE, GENDERHH) %>%
    summarise(households=n(), avg_adults = mean(ADULTS), avg_child = mean(CHILDREN))

In the command above, I placed a pipe right after the data frame BeanSurvey to tell R that this data frame should be the first argument of the group_by function below. And so the two first lines are grouping the BeanSurvey dataset by village and gender of the head of household. Then I placed a second pipe right after the group_by() function so that the resulting grouped data frame becomes the first argument of the summarise function, where we calculate the number of households and average number of adults and children.

You probably noticed that we have one weird row with an NA value in the column GENDERHH. This row corresponds to the household whose value for GENDERHH is missing. When grouping the data by gender of head of household, R has created an extra group because it doesn't know in which group this household with missing GENDERHH should be placed. I suggest we just remove this household from this analysis by using filter() at the beginning of our command. let's see...

This was the command we used earlier to keep only the household for which the head of household's gender was missing:

filter(BeanSurvey, is.na(GENDERHH)==TRUE)

So conversely, to keep the rows that are not missing, we can simply change the TRUE into FALSE:

filter(BeanSurvey, is.na(GENDERHH)==FALSE)

Let's now use a pipe to group the resulting data frame by VILLAGE and GENDERHH:

filter(BeanSurvey, is.na(GENDERHH)==FALSE) %>%
  group_by(VILLAGE, GENDERHH)

No change to the output, but we knew that since the effect of group_by() is invisible unless we add some other function. In our case, we want to add a summarise() function:

filter(BeanSurvey, is.na(GENDERHH)==FALSE) %>%
  group_by(VILLAGE, GENDERHH) %>%
    summarise(households=n(), avg_adults = mean(ADULTS), avg_child = mean(CHILDREN))

Yes! We got it!

To make our command even neater let's use a pipe between our data frame and the filter() function:

BeanSurvey %>%
  filter(is.na(GENDERHH)==FALSE) %>%
    group_by(VILLAGE, GENDERHH) %>%
      summarise(households=n(), avg_adults = mean(ADULTS), avg_child = mean(CHILDREN))

Of course, if we want to store the result of our full command into a data frame object for later use, we can do so in the usual way:

summary_data <- BeanSurvey %>%
  filter(is.na(GENDERHH)==FALSE) %>%
    group_by(VILLAGE, GENDERHH) %>%
      summarise(households=n(), avg_adults = mean(ADULTS), avg_child = mean(CHILDREN))

Nothing gets printed, but summary_data is saved as a data frame object, so after such command, we could look at it by calling it by name.

But we don't necessarily need to store our result to combine it with other functions. We can often directly pipe it into these other functions, even if these functions are not part of dplyr. That's because pipe is so popular that lots of the most recent packages provide functions that are compatible with pipes. For example, as said in the video, the first argument of the ggplot() function is a data frame, so ggplot() is compatible with pipes!

You can review our tutorials on ggplot here starting with part 1

Let's make a scatter plot from the result of our last command:

BeanSurvey %>%
  filter(is.na(GENDERHH)==FALSE) %>%
    group_by(VILLAGE, GENDERHH) %>%
      summarise(households=n(), avg_adults = mean(ADULTS), avg_child = mean(CHILDREN)) %>%
        ggplot(aes(x=VILLAGE, y=avg_adults, fill=GENDERHH))+
          geom_col(position = "dodge")

geom_col is similar to geom_bar, except that instead of the height of the bars being calculated from the frequencies of the categories in the data, in geom_col this height is directly defined by a y aesthetics. W added the argument position = "dodge" to place the bars side by side rather than stacked, as the latter would not make much sense in this case.

It's not the best graph ever, but it is still pretty neat, no? And did you notice how we didn't indicate our usual first argument in the ggplot() function? That's because the pipe operator told R to use the result of the summarise() function as the data frame for the graph!

Also note that we are always writing commands like the one above sequentially, one step at a time. Each individual step was easy and we can feel satisfied with the end product but trying to get there in one move would have been very difficult!

Question: Produce a boxplot of the household sizes for each village using pipes

BeanSurvey %>%
  mutate(household_size=ADULTS + CHILDREN) %>%
    filter(household_size<16) %>%
      ggplot(aes(x=VILLAGE, y=household_size)) +
        geom_boxplot()
# the filter() line was not needed, but I decided to remove the extreme household
# to show you how easy and intuitive it is to add bits of line to a command using pipes

Once you are finished with the following quiz, feel free to take a look at our prepared exercises

Alternatively you can move through to part 1 of our tutorials on statistical analysis, starting with hypothesis testing

Quiz

Question 1

Question 2

Question 3

Question 4

group_by(BeanSurvey, AGEHH)
summarise(BeanSurvey_grouped, avg_landArea = mean(LANDAREA))

Question 5

BeanSurvey %>%
  mutate(household_size=ADULTS+CHILDREN) %>%
    filter(OCCUHH=="Farmer" & is.na(household_size)==FALSE) %>%
      group_by(household_size) %>%
        summarise(n=n())

Appendix: 'BeanSurvey' dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column	Description
ID	Farmer ID
VILLAGE	Village name
HHTYPE	Household composition
GENDERHH	Gender of Household Head
AGEHH	Age of Household Head
OCCUHH	Occupation of Household Head
ADULTS	Number of Adults within the household
CHILDREN	Number of Children (<18) within the household
MATOKE	Do they grow matoke?
MAIZE	Do they grow maize?
BEANS	Do they grow beans?
BANANA	Do they grow banana?
CASSAVA	Do they grow cassava?
COFFEE	Do they grow coffee?
LANDAREA	Land area of farm (acres)
LABOR	Labor usage
INTERCROP	Intercrops with beans
DECISIONS	Household decision responsibility
SELLBEANS	Do they grow beans for sale?
BEANSPLANTED_LR	Quantity of beans planted in long rain season
BEANSPLANTED_SR	Quantity of beans planted in short rain season
BEANSHARVESTED_LR	Quantity of beans harvested in long rain season
BEANSHARVESTED_SR	Quantity of beans harvested in short rain season

Spend some time exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".

(You can use the arrow keys on your keyboard to scroll right in case the data table does not fit entirely on your screen)

Appendix: Useful reference links

The official dplyr documentation: https://dplyr.tidyverse.org/

dplyr CheatSheet:https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf

Data Manipulation Tools - Rstudio video: dplyr -- Pt 3 Intro to the Grammar of Data Manipulation with R

Some documentation on subsetting r-objects using base-R: https://bookdown.org/rdpeng/rprogdatascience/subsetting-r-objects.html