Making Graphs in R Using ggplot2 - Exercises & Solutions

Exercises

Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.

You can review the first tutorial here

All of the exercises are using the BeanSurvey data we have been working with so far. You can see a description of the data in the appendix to remind yourself.

Once you have completed these exercises, you can move onto our second data visualisation tutorial

Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH

ggplot(data=ZZZ, aes(x=ZZZ)) + 
  geom_ZZZ()

Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village

ggplot2(data = BEANSURVEY,  aes(x = AgeHH, y = Village) 
  geom_box()

Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre

Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?

Press "Next Topic" to see the solutions

Solutions

Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH

EXERCISE:
ggplot(data=ZZZ, aes(x=ZZZ)) + 
  geom_ZZZ()

#SOLUTION
ggplot(BeanSurvey, aes(x=GENDERHH)) + 
  geom_bar()

To solve the question we need to replace three ZZZs - first with the name of the data BeanSurvey and then with the name of the variable within the mapping GENDERHH. Be very careful to make sure you spell these exactly correct with upper and lower case.

We also need to set the correct geom_ - in this case to show a bar chart of frequencies we use geom_bar

Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village

EXERCISE:
ggplot2(data = BEANSURVEY,  aes(x = AgeHH, y = Village) 
  geom_box()

# SOLUTION
ggplot(data = BeanSurvey,  aes(x = AGEHH, y = VILLAGE)) +
  geom_boxplot()

There are lots of errors!:

There is no + at the end of the first line. Remember this is how we connect the different layers of our plotting code to build up the different aspects of our graph.
In the second line there are two brackets opened but only one is closed. This is a very common mistake to make as it can be difficult to spot when you start having too many. If you are working in R studio instead there are measure to assist with this as R is good at detecting when there is an open bracket, though not at telling you where you need to close it. So always be vigilant.
The function is ggplot not ggplot2
The data has been entered with the incorrect case - BeanSurvey not BeanSurvey
The variables have been entered with incorrect cases. AGEHH not AgeHH; VILLAGE not Village
The correct function is geom_boxplot()

Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre

We first need to convert the question into the different ggplot elements:

data-> BeanSurvey
aesthetics->x->LANDAREA
geom->histogram

ggplot(BeanSurvey,aes(x=LANDAREA))+
  geom_histogram()

As you can see, we first get a message with our plot telling us that creating 30 bins is the default setting for a histogram.

We can then think about modifying to only cover 1 acre per bin.

We need to find the option binwidth from within the geom_histogram function. Remember you can always use the R help pages to look into the arguments that any function can use.

ggplot(BeanSurvey,aes(x=LANDAREA))+
  geom_histogram(binwidth = 1)

Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?

Again lets try to think about the different aspects in building this plot based on the question

data-> BeanSurvey
aesthetics->x->CHILDREN
aesthetics->y->ADULTS
geom->point

ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_point()

A limitation here is that we have lots of overlapping observations. e.g. lots of people in our data have 2 adults and 3 children. But we only see one point. This is an example of overplotting where so much of the information is hidden due to a small range of possible values, resulting in our points stacking on top of one another.

This could be a case where we might want to do a jitter plot instead.

ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_jitter(width=0.1,height=0.1)

The width and height arguments control how much away from the centre the points are allowed to jitter away from the true value.

It is best to make sure you think about the units you are working with when using a jitter plot. As you can see below, if i start putting the width and height too high then the points become too dispersed and it is ultimately impossible to tell to which point they originally belong.

ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
  geom_jitter(width=0.5,height=0.5)

Appendix: 'BeanSurvey' dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column	Description
ID	Farmer ID
VILLAGE	Village name
HHTYPE	Household composition
GENDERHH	Gender of Household Head
AGEHH	Age of Household Head
OCCUHH	Occupation of Household Head
ADULTS	Number of Adults within the household
CHILDREN	Number of Children (<18) within the household
MATOKE	Do they grow matoke?
MAIZE	Do they grow maize?
BEANS	Do they grow beans?
BANANA	Do they grow banana?
CASSAVA	Do they grow cassava?
COFFEE	Do they grow coffee?
LANDAREA	Land area of farm (acres)
LABOR	Labor usage
INTERCROP	Intercrops with beans
DECISIONS	Household decision responsibility
SELLBEANS	Do they grow beans for sale?
BEANSPLANTED_LR	Quantity of beans planted in long rain season
BEANSPLANTED_SR	Quantity of beans planted in short rain season
BEANSHARVESTED_LR	Quantity of beans harvested in long rain season
BEANSHARVESTED_SR	Quantity of beans harvested in short rain season

Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".