Manipulating Data using dplyr: Exercises and Solutions

Exercises

Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.

All of the exercises are using the BeanSurvey data we have been working with so far. You can see a description of the data in the appendix to remind yourself.

You can always review the tutorials at any time you need, Part 1, Part 2

Exercise 1

Exercise

Exercise 1. retrieve the households that grow banana

Please press "Continue" to reveal the solution

Solution

Exercise 1. retrieve the households that grow banana

The exercise is asking that we only keep the households that grow bananas. Therefore, we need to use the filter function, as you'll remember this function allows us to keep certain rows that meet a certain condition.

Remember that we have to use the double equals ==, as this asks R "is x equal to y?" where as a single equal sign tells R that "X IS EQUAL TO Y!". For a filter we always use the double ==.

filter(BeanSurvey, BANANA=="Yes")

We could also have used a pipe:

BeanSurvey %>%
  filter(BANANA=="Yes")

Please press "Continue" to move to the next exercise

Exercise 2

Exercise

Exercise 2. Identify and correct the four mistakes that I made in the command below, to obtain the median land area of farm of all the households in the BeanSurvey dataset

BeanSurvey %>%
  filter(BeanSurvey, OCCUHH="Farmer") %>%
   sumarise(median_landArea = median(LANDAREA)

Please press "Continue" to reveal the solution

Solution

Exercise 2. Identify and correct the four mistakes that I made in the command below, to obtain the median land area of farm of all the households in the BeanSurvey dataset

EXERCISE:
BeanSurvey %>%
  filter(BeanSurvey, OCCUHH="Farmer") %>%
   sumarise(median_landArea = median(LANDAREA)

We have a few mistakes to correct here so let's go one by one.

In our filter we have supplied BeanSurvey to the data argument, but we have already started our code chunk with the data being piped into our filter. We have doubled up on telling R what our dataset is. Remember that when piping the first data argument will be implicitly filled by whatever the data looks like following the previous step. So either BeanSurvey should be removed from filter, or the first line should be deleted.

As already mentioned in the previous exercise, the condition of the filter function needs a double equals not a single equals,

The function summarise was misspelled. While R will recognise both the american summarize and the British summarise spellings, the rest of the function still needs to be spelt correctly.

Finally, the parenthesis of the function summarise was not closed. This is often the simplest mistake to make.

BeanSurvey %>%
  filter(OCCUHH=="Farmer") %>%
   summarize(median_landArea = median(LANDAREA))

Please press "Continue" to move to the next exercise

Exercise 3

Exercise

Exercise 3. which are the 4 households who planted the largest quantity of beans during short rain season?

Please press "Continue" to reveal the solution

Solution

Exercise 3. which are the 4 households who planted the largest quantity of beans during short rain season?

Approaching this question we first need to think how can we figure out which households plated the most beans. It would make the most sense if we can sort the data on this variable so we have the rows ordered by how much they planted. Therefore we need to use the arrange function.

BeanSurvey %>% 
  arrange(BEANSPLANTED_SR)

However, you may remember that this orders ascendingly by default, so those with the lowest numbers would be at the top. We can correct this by wrapping the variable in the desc function.

BeanSurvey %>% 
  arrange(desc(BEANSPLANTED_SR))

But we still have all the rows. While we can see which are the top 4, let's go one further and only keep these first 4. We can do this using slice this will keep whichever row numbers we supply, in this case 1:4

BeanSurvey %>% 
  arrange(desc(BEANSPLANTED_SR)) %>%
    slice(1:4)

Please press "Continue" to move to the next exercise

Exercise 4

Exercise

Exercise 4. What is the mean, median and standard deviation of the yield of bean per acre that households harvested during the long rain season in each village

Exercise 4b. How would you produce the same summary statistics, but by village AND by gender of the head of household rather than by village only?

Please press "Continue" to reveal the solution

Solutions

Exercise 4. What is the mean, median and standard deviation of the yield of bean per acre that households harvested during the long rain season in each village

The yield of beans per acre is not a column that we have in our dataset, so we need to create it. Remember that mutate is the function we use for that. So we start with first creating a new column, we'll call it yield_per_acre, and set this to be equal to our harvested quantity (BEANSHARVESTED_LR) divided by our farm area (LANDAREA).

BeanSurvey %>% 
  mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA)

We then want summaries for each village so first we need to use group_by to make sure our results are calculated within each village. We pipe this into summarise and specify each of the summary statistics mentioned in the question. They each have straightforward function names of mean, median and sd (standard deviation).

The argument na.rm is required as there are missing values in the variable BEANSHARVESTED_LR, and therefore also in yield_per_acre

BeanSurvey %>% 
  mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
    group_by(VILLAGE) %>%
      summarise(mean= mean(yield_per_acre, na.rm=TRUE),
                median=median(yield_per_acre, na.rm=TRUE),
                standard_deviation=sd(yield_per_acre, na.rm=TRUE))

Another solution to address the missing value issue would be to filter out the households with missing value in the first place, using filter and is.na, for example at the very beginning:

BeanSurvey %>% 
  filter(is.na(BEANSHARVESTED_LR)==FALSE) %>%
    mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
      group_by(VILLAGE) %>%
        summarise(mean= mean(yield_per_acre),
                  median=median(yield_per_acre),
                  standard_deviation=sd(yield_per_acre))

Exercise 4b. How would you produce the same summary statistics, but by village AND by gender of the head of household rather than by village only?

We can start by just copying the code from the question 4 and add the column GENDERHH in the group_by function.

BeanSurvey %>% 
  filter(is.na(BEANSHARVESTED_LR)==FALSE) %>%
    mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
      group_by(VILLAGE, GENDERHH) %>%
        summarise(mean= mean(yield_per_acre),
                  median=median(yield_per_acre),
                  standard_deviation=sd(yield_per_acre))

However we have 5 rows when we expected just 4?

Would seem that there is one row where there is a missing value for GENDERHH. Probably better we remove this from the analysis too so we have a neater output.

BeanSurvey %>% 
  filter(is.na(BEANSHARVESTED_LR)==FALSE & is.na(GENDERHH)==FALSE) %>%
    mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
      group_by(VILLAGE, GENDERHH) %>%
        summarise(mean= mean(yield_per_acre),
                  median=median(yield_per_acre),
                  standard_deviation=sd(yield_per_acre))

Please press "Continue" to move to the next exercise

Exercise 5

Exercise

Exercise 5. Generate a scatterplot showing for each household who have planted beans, the total quantity of the beans planted against the land area of their farm. Colour the points by gender of the head of household

Please press "Continue" to reveal the solution

Solution

We create variable a total quantity of beans planted using mutate and then pipe into ggplot. Remember that once you swap into ggplot to use + to add layers and not to use the %>% pipe. Though doing this will generate a helpful error message.

BeanSurvey %>%
  mutate(total_beans_planted = BEANSPLANTED_LR + BEANSPLANTED_SR)%>%
      ggplot(aes(x=LANDAREA, y=total_beans_planted, colour=GENDERHH))+
        geom_point()

But of course we have that missing value in GENDERHH, this has been added into our colour scheme as a grey dot that is included in our key

We can decide if it's an issue or not. I decide to remove it and so I add a filter to get rid of the missing value.

BeanSurvey %>%
  mutate(total_beans_planted = BEANSPLANTED_LR + BEANSPLANTED_SR)%>%
    filter(is.na(GENDERHH)==FALSE) %>%
      ggplot(aes(x=LANDAREA, y=total_beans_planted, colour=GENDERHH))+
        geom_point()

Please press "Continue" to move to the next exercise

Exercise 6

Exercise

Exercise 6. Generate a boxplot of the quantity of beans harvested during the long rain season by type of household composition, keeping only the two main household composition types. Apply an appropriate 'scale' transformation to the quantity of beans harvested

Please press "Continue" to reveal the solution

Solution

This is a follow up of a question from the ggplot exercises you may have previously worked on.

We first simply use filter to keep only the households whose composition type is either "Female headed, no husband" or "Male headed one wife". Remember we can use the vertical line | to mean OR in a filter.

We then pipe this into our ggplot using both geom_boxplot and geom_point to show all our data.

BeanSurvey %>%
  filter(HHTYPE=="Female headed, no husband" | HHTYPE=="Male headed one wife") %>%
    ggplot(aes(y=BEANSHARVESTED_LR, x=HHTYPE))+
      geom_boxplot()+
        geom_point()

Finally, the question asks that we include some kind of transformation to our y axis scale. Therefore we need to use one of the scale_ functions. In this case, we want to edit the y axis, which is a continuous variable. SO we need scale_y_continuous.

If you use the web or the help page for scale_y_continuous. You will see that one of the arguments is trans =. This can take any number of different operations by default. Including many different log transformations, reverse, square roots, etc. Feel free to play around with what ones work and which ones don't.

In our example, "pseudo_log" seems like a good option as it provides more space for the number as at the lower end of the scale so our outliers right at the top take up less room on the plotting space.

BeanSurvey %>%
  filter(HHTYPE=="Female headed, no husband" | HHTYPE=="Male headed one wife") %>%
    ggplot(aes(y=BEANSHARVESTED_LR, x=HHTYPE))+
      geom_boxplot()+
        geom_point()+
          scale_y_continuous(trans="pseudo_log")

Appendix: 'BeanSurvey' dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column	Description
ID	Farmer ID
VILLAGE	Village name
HHTYPE	Household composition
GENDERHH	Gender of Household Head
AGEHH	Age of Household Head
OCCUHH	Occupation of Household Head
ADULTS	Number of Adults within the household
CHILDREN	Number of Children (<18) within the household
MATOKE	Do they grow matoke?
MAIZE	Do they grow maize?
BEANS	Do they grow beans?
BANANA	Do they grow banana?
CASSAVA	Do they grow cassava?
COFFEE	Do they grow coffee?
LANDAREA	Land area of farm (acres)
LABOR	Labor usage
INTERCROP	Intercrops with beans
DECISIONS	Household decision responsibility
SELLBEANS	Do they grow beans for sale?
BEANSPLANTED_LR	Quantity of beans planted in long rain season
BEANSPLANTED_SR	Quantity of beans planted in short rain season
BEANSHARVESTED_LR	Quantity of beans harvested in long rain season
BEANSHARVESTED_SR	Quantity of beans harvested in short rain season

Spend some time exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".

(You can use the arrow keys on your keyboard to scroll right in case the data table does not fit entirely on your screen)