Skip to Tutorial Content

Exercises

Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.

You can review the first tutorial here

All of the exercises are using the BeanSurvey data we have been working with so far. You can see a description of the data in the appendix to remind yourself.

Once you have completed these exercises, you can move onto our second data visualisation tutorial

Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH

ggplot(data=ZZZ, aes(x=ZZZ)) + 
  geom_ZZZ()

Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village

ggplot2(data = BEANSURVEY,  aes(x = AgeHH, y = Village) 
  geom_box()

Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre

Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?

Press "Next Topic" to see the solutions

Solutions

Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH

EXERCISE:
ggplot(data=ZZZ, aes(x=ZZZ)) + 
  geom_ZZZ()
#SOLUTION
ggplot(BeanSurvey, aes(x=GENDERHH)) + 
  geom_bar()

To solve the question we need to replace three ZZZs - first with the name of the data BeanSurvey and then with the name of the variable within the mapping GENDERHH. Be very careful to make sure you spell these exactly correct with upper and lower case.

We also need to set the correct geom_ - in this case to show a bar chart of frequencies we use geom_bar

Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village

EXERCISE:
ggplot2(data = BEANSURVEY,  aes(x = AgeHH, y = Village) 
  geom_box()
# SOLUTION
ggplot(data = BeanSurvey,  aes(x = AGEHH, y = VILLAGE)) +
  geom_boxplot()

There are lots of errors!:

  • There is no + at the end of the first line. Remember this is how we connect the different layers of our plotting code to build up the different aspects of our graph.
  • In the second line there are two brackets opened but only one is closed. This is a very common mistake to make as it can be difficult to spot when you start having too many. If you are working in R studio instead there are measure to assist with this as R is good at detecting when there is an open bracket, though not at telling you where you need to close it. So always be vigilant.
  • The function is ggplot not ggplot2
  • The data has been entered with the incorrect case - BeanSurvey not BeanSurvey
  • The variables have been entered with incorrect cases. AGEHH not AgeHH; VILLAGE not Village
  • The correct function is geom_boxplot()

Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre

We first need to convert the question into the different ggplot elements:

  • data-> BeanSurvey
  • aesthetics->x->LANDAREA
  • geom->histogram
ggplot(BeanSurvey,aes(x=LANDAREA))+
  geom_histogram()

As you can see, we first get a message with our plot telling us that creating 30 bins is the default setting for a histogram.

We can then think about modifying to only cover 1 acre per bin.

We need to find the option binwidth from within the geom_histogram function. Remember you can always use the R help pages to look into the arguments that any function can use.

ggplot(BeanSurvey,aes(x=LANDAREA))+
  geom_histogram(binwidth = 1)

Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?

Again lets try to think about the different aspects in building this plot based on the question

  • data-> BeanSurvey
  • aesthetics->x->CHILDREN
  • aesthetics->y->ADULTS
  • geom->point
ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_point()

A limitation here is that we have lots of overlapping observations. e.g. lots of people in our data have 2 adults and 3 children. But we only see one point. This is an example of overplotting where so much of the information is hidden due to a small range of possible values, resulting in our points stacking on top of one another.

This could be a case where we might want to do a jitter plot instead.

ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_jitter(width=0.1,height=0.1)

The width and height arguments control how much away from the centre the points are allowed to jitter away from the true value.

It is best to make sure you think about the units you are working with when using a jitter plot. As you can see below, if i start putting the width and height too high then the points become too dispersed and it is ultimately impossible to tell to which point they originally belong.

ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_jitter(width=0.5,height=0.5)

Appendix: 'BeanSurvey' dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column Description
ID Farmer ID
VILLAGE Village name
HHTYPE Household composition
GENDERHH Gender of Household Head
AGEHH Age of Household Head
OCCUHH Occupation of Household Head
ADULTS Number of Adults within the household
CHILDREN Number of Children (<18) within the household
MATOKE Do they grow matoke?
MAIZE Do they grow maize?
BEANS Do they grow beans?
BANANA Do they grow banana?
CASSAVA Do they grow cassava?
COFFEE Do they grow coffee?
LANDAREA Land area of farm (acres)
LABOR Labor usage
INTERCROP Intercrops with beans
DECISIONS Household decision responsibility
SELLBEANS Do they grow beans for sale?
BEANSPLANTED_LR Quantity of beans planted in long rain season
BEANSPLANTED_SR Quantity of beans planted in short rain season
BEANSHARVESTED_LR Quantity of beans harvested in long rain season
BEANSHARVESTED_SR Quantity of beans harvested in short rain season

Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".

Making Graphs in R Using ggplot2 - Exercises & Solutions