Exercises
Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.
You can review the first tutorial here
All of the exercises are using the BeanSurvey
data we have been working with so far. You can see a description of the data in the appendix to remind yourself.
Once you have completed these exercises, you can move onto our second data visualisation tutorial
Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH
ggplot(data=ZZZ, aes(x=ZZZ)) +
geom_ZZZ()
Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village
ggplot2(data = BEANSURVEY, aes(x = AgeHH, y = Village)
geom_box()
Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre
Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?
Press "Next Topic" to see the solutions
Solutions
Exercise 1. Replace each of the instances of "ZZZ" from the code below to produce a bar chart showing the frequencies of the household head gender variable GENDERHH
EXERCISE:
ggplot(data=ZZZ, aes(x=ZZZ)) +
geom_ZZZ()
#SOLUTION
ggplot(BeanSurvey, aes(x=GENDERHH)) +
geom_bar()
To solve the question we need to replace three ZZZs - first with the name of the data BeanSurvey
and then with the name of the variable within the mapping GENDERHH
. Be very careful to make sure you spell these exactly correct with upper and lower case.
We also need to set the correct geom_ - in this case to show a bar chart of frequencies we use geom_bar
Exercise 2. Identify and fix the error(s) in this code to produce a boxplot of age of household head by village
EXERCISE:
ggplot2(data = BEANSURVEY, aes(x = AgeHH, y = Village)
geom_box()
# SOLUTION
ggplot(data = BeanSurvey, aes(x = AGEHH, y = VILLAGE)) +
geom_boxplot()
There are lots of errors!:
- There is no
+
at the end of the first line. Remember this is how we connect the different layers of our plotting code to build up the different aspects of our graph. - In the second line there are two brackets opened but only one is closed. This is a very common mistake to make as it can be difficult to spot when you start having too many. If you are working in R studio instead there are measure to assist with this as R is good at detecting when there is an open bracket, though not at telling you where you need to close it. So always be vigilant.
- The function is
ggplot
notggplot2
- The data has been entered with the incorrect case -
BeanSurvey
notBeanSurvey
- The variables have been entered with incorrect cases.
AGEHH
notAgeHH
;VILLAGE
notVillage
- The correct function is
geom_boxplot()
Exercise 3. Produce a histogram of land area. Set the bins so that each bin covers a range of 1 acre
We first need to convert the question into the different ggplot elements:
- data-> BeanSurvey
- aesthetics->x->LANDAREA
- geom->histogram
ggplot(BeanSurvey,aes(x=LANDAREA))+
geom_histogram()
As you can see, we first get a message with our plot telling us that creating 30 bins is the default setting for a histogram.
We can then think about modifying to only cover 1 acre per bin.
We need to find the option binwidth
from within the geom_histogram function. Remember you can always use the R help pages to look into the arguments that any function can use.
ggplot(BeanSurvey,aes(x=LANDAREA))+
geom_histogram(binwidth = 1)
Exercise 4. Make a scatter plot showing the number of adults on the x axis against the number of children on the y axis. Can you see any limitations to this plot?
Again lets try to think about the different aspects in building this plot based on the question
- data-> BeanSurvey
- aesthetics->x->CHILDREN
- aesthetics->y->ADULTS
- geom->point
ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
geom_point()
A limitation here is that we have lots of overlapping observations. e.g. lots of people in our data have 2 adults and 3 children. But we only see one point. This is an example of overplotting where so much of the information is hidden due to a small range of possible values, resulting in our points stacking on top of one another.
This could be a case where we might want to do a jitter plot instead.
ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
geom_jitter(width=0.1,height=0.1)
The width and height arguments control how much away from the centre the points are allowed to jitter away from the true value.
It is best to make sure you think about the units you are working with when using a jitter plot. As you can see below, if i start putting the width and height too high then the points become too dispersed and it is ultimately impossible to tell to which point they originally belong.
ggplot(BeanSurvey,aes(x=ADULTS,y=CHILDREN))+
geom_jitter(width=0.5,height=0.5)
Appendix: 'BeanSurvey' dataset
The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.
The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey
.
A summary of the columns in the dataset is below.
Column | Description |
---|---|
ID | Farmer ID |
VILLAGE | Village name |
HHTYPE | Household composition |
GENDERHH | Gender of Household Head |
AGEHH | Age of Household Head |
OCCUHH | Occupation of Household Head |
ADULTS | Number of Adults within the household |
CHILDREN | Number of Children (<18) within the household |
MATOKE | Do they grow matoke? |
MAIZE | Do they grow maize? |
BEANS | Do they grow beans? |
BANANA | Do they grow banana? |
CASSAVA | Do they grow cassava? |
COFFEE | Do they grow coffee? |
LANDAREA | Land area of farm (acres) |
LABOR | Labor usage |
INTERCROP | Intercrops with beans |
DECISIONS | Household decision responsibility |
SELLBEANS | Do they grow beans for sale? |
BEANSPLANTED_LR | Quantity of beans planted in long rain season |
BEANSPLANTED_SR | Quantity of beans planted in short rain season |
BEANSHARVESTED_LR | Quantity of beans harvested in long rain season |
BEANSHARVESTED_SR | Quantity of beans harvested in short rain season |
Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".