Exercise 1
Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.
You can review either the first tutorial or the second part should you need any help
All of the exercises are using the BeanSurvey
data we have been working with so far. You can see a description of the data in the appendix to remind yourself.
Question
Question 1: I am trying to make a histogram of the farmer's ages with different panels for each village. Can you identify and fix the errors in my code?
ggplot(BeanSurvey,aes(x=AGEHH))+
geom_hist(binwidth =5)+
facet_wrap(VILLAGE)
Please click "continue" to reveal the solution
Solution
Question 1: I am trying to make a histogram of the farmer's ages with different panels for each village. Can you identify and fix the errors in my code?
Exercise:
ggplot(BeanSurvey,aes(x=AGEHH))+
geom_hist(binwidth =5)+
facet_wrap(VILLAGE)
Remember to try to find the errors by running the code and checking the messages. It can be very difficult to just magically spot the error by starting at the code without some help from the error message!
There were two mistakes:
Firstly the correct geometry should be geom_histogram()
rather than geom_hist()
. While the error message will not give the correct answer, the auto complete within the code window could help you with this.
Secondly the tilda ~
was missing from within facet_wrap()
- remember this is needed in facet_wrap before the variable name. Unfortunately the error message will not be too helpful in informing you what change needs to be made.
ggplot(BeanSurvey,aes(x=AGEHH))+
geom_histogram(binwidth =5)+
facet_wrap(~VILLAGE)
Please click "continue" to move to the next exercise
Exercise 2
Questions
Question 2a: I am making a bar plot of decision making, by gender of the household head. Can you modify the code, so that it looks like the example below with female headed households coloured in 'purple' and male headed households coloured in 'orange'
ggplot(BeanSurvey,aes(y=DECISIONS))+
geom_bar()
Question 2b: Taking the plot you created in Question 2a, now make some changes to the labels: i) remove the y axis label; ii) Change the x axis label to read "Number of Farmers"; iii) Add an informative title
Please click "continue" to reveal the solution
Soultion
Question 2a: I am making a barplot of decision making, by gender of the household head. Can you modify the code, so that it looks like the example below with female headed households coloured in 'purple' and male headed households coloured in 'orange'
Exercise:
ggplot(BeanSurvey,aes(y=DECISIONS))+
geom_bar()
To get the required plot we need to make two modifications. Firstly we need to set the fill
aesthetic to be based on the GENDERHH
column from the data. The bars have two sorts of colour we can modify - fill
for the inner region and colour
for the outline. If we changed colour
this would still have the bars shaded in grey but with red and blue outlines as seen below.
This is not what we want! We want the shading colour, the fill
to be modified. And because we are varying the colour based on a column from the data we set it in the aesthetics not in the geometry.
To use the purple and orange colours we need to modify the fill aesthetic using the correct scale_
function. Because we are changing the fill
aesthetic manually then the function we need to add is scale_fill_manual
. Inside this function we have two colours that we need specify to the values
argument because there are two values for GENDERHH
, so we need to use the c()
function to bring these colours together.
ggplot(BeanSurvey,aes(y=DECISIONS,fill=GENDERHH))+
geom_bar()+
scale_fill_manual(values=c("purple","orange"))
Question 2b: Taking the plot you created in Question 2a, now make some changes to the labels: i) remove the y axis label; ii) Change the x axis label to read "Number of Farmers"; iii) Add an informative title
I can modify the y axis label so that it is blank "" rather than "remove" it - this is an easier way to achieve the same thing! So all three of my steps can be done in the same function - labs()
and calling the y
, x
and title
labels.
There are actually a few ways to remove an axis label that you may have come across if you used help pages or the internet to find a solutions. These would also have been valid;
labs(y = NULL)
theme(axis.title.y = element_blank())
Depending on my screen resolution I might encounter some problems with my title running off the page. This is where it could be useful to also add a call to the theme
function and make the text size smaller using element_text()
. An alternative solution could be to split the title into two components - a title
and a subtitle
.
ggplot(BeanSurvey,aes(y=DECISIONS,fill=GENDERHH))+
geom_bar() +
scale_fill_manual(values = c("purple","orange"))+
labs(y="", x="Number of Farmers",title="Barchart of Decision Making by Gender of Head") +
theme(title = element_text(size=9))
Please click "continue" to move to the next exercise
Exercise 3
Questions
Question 3: Make a scatter plot of the harvested quantities of beans in the long rains against the planted quantity of beans in the long rains. Place the harvested quantity on the y axis, and the planted quantity on the x axis.Now consider how to also show how the different villages are associated with this this relationship. Try two different options: a) Make different coloured points for each village
b) Put the two villages in different panels
Consider which of these two plots shows the relationship more clearly. Take your preferred plot, and tidy up the axis labels and titles.
Please click "continue" to reveal the solution
Solution
Question 3: Make a scatter plot of the harvested quantities of beans in the long rains against the planted quantity of beans in the long rains. Place the harvested quantity on the y axis, and the planted quantity on the x axis.
The question has done some of the heavy lifting for working out our solution
- Y axis = Harvested quantity of beans in the long rains (
BEANSHARVESTED_LR
) - X axis = Planted quantity of beans in the long rains (
BEANSPLANTED_LR
) - Scatter plot =
geom_point
Note that you will receive a warning message about the removal of 3 observations due to missing data. This is a common warning message to receive, but you may wish to take note in your own analysis if this is unexpected.
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=BEANSPLANTED_LR)) +
geom_point()
Now consider how to also show how the different villages are associated with this this relationship. Try two different options: a) Make different coloured points for each village
The key thing here is to understand that for the default plotting symbol of geom_point
the way to modify the colour is through colour
rather than fill
. So differently to the barchart from Q2, this time I need to map the colour
aesthetic to the VILLAGE
column. If you were to use fill, nothing would actually change as there is no space to "fill" with colour, the points would stay black.
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=BEANSPLANTED_LR,colour=VILLAGE)) +
geom_point()
b) Put the two villages in different panels
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=BEANSPLANTED_LR)) +
geom_point() +
facet_wrap(~VILLAGE)
Remember to use the tilde ~
before the VILLAGE
when making the facet
Consider which of these two plots shows the relationship more clearly. Take your preferred plot, and tidy up the axis labels and titles.
There are pros and cons to both plots (and lots of further ways to modify the plots as explored in the tutorials).
The 'right' plot here depends one exactly what we want to highlight - the facets show the within village patterns much more clearly, but make it harder to compare between villages. The colours allow you to compare the villages more easily, but make seeing the results within each village more difficult.
We could of course do both! Which is what I've done here and then used labs to label my axes. But of course this is more or less the same presentation as just using the facets, the colour is mostly just a bonus visual element.
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=BEANSPLANTED_LR,colour=VILLAGE)) +
geom_point() +
facet_wrap(~VILLAGE)+
labs(x="Beans Planted in Long Rains (kg)",y="Beans Harvested in Long Rains (kg)",
title="Beans Planted vs Beans Harvested",subtitle="Long Rains",colour="Village")
Please click "continue" to move to the next exercise
Exercise 4
Questions
Question 4a: Make a plot showing how the quantity of beans harvested in the long rains is related to the household type (HHTYPE
). Choose a sensible geometry to show this relationship.
HHTYPE
variable are quite long, the text along the axis became squashed. Look into the elements which can be customised within theme
and try to make these labels fit better by decreasing the font size or modifying the angle at which the labels are aligned to the axis
Please click "continue" to reveal the solution
Solution
Question 4a: Make a plot showing how the quantity of beans harvested in the long rains is related to the household type (HHTYPE
). Choose a sensible geometry to show this relationship.
When considering what sort of plot to make we should always consider what type of variable we have. In this case we have a continuous numeric variable BEANSHARVESTED_LR
and a categorical variable HHTYPE
. A common plot to make here might be a boxplot.
However - this would actually be very misleading because of the way the data is distributed. There are very few observations of households in certain 'types' - just two single men and just one 'other'. Because the boxplots provide a summary of the values in each group then we need to have enough observations so that we can form a reasonable and meaningful summary. When you have only a few observations per group probably the better option would be to use a point geometry.
In this case, particularly with the zero values, there are many points overlapping, like in the final question from the previous module. So a jittered scatter plot may be better using geom_jitter. A better plot might be possible by combining or removing the small categories. This sort of data manipulation will be covered in a separate set of tutorials.
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=HHTYPE))+
geom_jitter(width=0.1)
Question 4b: Whatever plot you decided on for question 4a, you may have found that because the labels for the HHTYPE
variable are quite long, the text along the axis became squashed. Look into the elements which can be customised within theme
and try to make these labels fit better by decreasing the font size or modifying the angle at which the labels are aligned to the axis
ggplot(BeanSurvey , aes(y=BEANSHARVESTED_LR,x=HHTYPE))+
geom_jitter(width=0.1)+
theme(axis.text.x = element_text(size=7,angle=15))
In the theme
function we can find the component we want to modify, the axis text on the x axis, and then use element_text
to make this more legible. Using angle we can set this to be at 90 degrees (vertical) or at a slight angle, 10 to 15 degrees, and these could provide ways of making the labels easier to read. Just reducing the size of labels will probably make them too small to read!
Please click "continue" to move to the next exercise
Exercise 5
Questions
Question 5: Make a plot showing the relationship between the gender of the household head (GENDERHH
), village, (VILLAGE
) and whether the household sells any of their beans (SELLBEANS
). Include nice colours, sensible axis labels, a title, and use one of the custom built-in themes to make the plot look nice. Also consider increasing the font size for some of the labels so they can be read more clearly.
Please click "continue" to reveal the solution
Solution
Question 5: Make a plot showing the relationship between the gender of the household head (GENDERHH
), village, (VILLAGE
) and whether the household sells any of their beans (SELLBEANS
). Include nice colours, sensible axis labels, a title, and use one of the custom built-in themes to make the plot look nice. Also consider increasing the font size for some of the labels so they can be read more clearly.
There are lots of ways you could have chosen to make a plot here! The way I went about it is just one option.
With these three variables the question I am probably most interested in is looking at whether gender of household head and village have any influence on the likelihood that the household sells beans.
I have three numeric variables so I will be making some bar charts. I can have one variable as the bars, one on the x axis and the third as a facet. The bars should probably represent my outcome- "selling beans".
Structurally it would make most sense to have facets for village and gender on the x axis. The male and female headed households live within a village so grouping in this way is more logical and easier to understand than grouping panels by gender and having village on x axis.
I would quite like my bars to represent percentages rather than frequencies, because I have an unequal number of male and female headed households and i would like to compare them. Using "position=fill" within geom bar converts the bars from counts into proportions.
The axis labels are for a proportion - using scale_y_continuous
i can change this into a percentage by setting the breaks and labels. This will also let me change from the breaks every 25% which is harder to read off the values as compared to setting the breaks every 20%.
I can do a further trick when assigning colours - we are most interest in the % selling beans. And we can always work out the % not selling beans because it will add up to 100%. So when I set my colours - if i set a missing or fake colour for the "not selling beans group" the plot becomes a bit easier to read as I just see the bars making up the % selling beans.
I've finished off by using the light
theme which i think looks quite nice, and adding in some labels.
ggplot(BeanSurvey,aes(fill=SELLBEANS,x=GENDERHH))+
geom_bar(position="fill",show.legend=FALSE) +
scale_y_continuous(breaks=c(0,0.2,0.4,0.6,0.8,1),labels=c("0%","20%","40%","60%","80%","100%"))+
scale_fill_manual(values=c(NA,"darkred"))+
facet_wrap(~VILLAGE)+
labs(y="% Selling Beans",title="Barplot of Bean Sales by Gender and Village",x="Gender")+
theme_light()
If you wish to move on through to our tutorials on data manipulation, here is part 1
Appendix: 'BeanSurvey' dataset
The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.
The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey
.
A summary of the columns in the dataset is below.
Column | Description |
---|---|
ID | Farmer ID |
VILLAGE | Village name |
HHTYPE | Household composition |
GENDERHH | Gender of Household Head |
AGEHH | Age of Household Head |
OCCUHH | Occupation of Household Head |
ADULTS | Number of Adults within the household |
CHILDREN | Number of Children (<18) within the household |
MATOKE | Do they grow matoke? |
MAIZE | Do they grow maize? |
BEANS | Do they grow beans? |
BANANA | Do they grow banana? |
CASSAVA | Do they grow cassava? |
COFFEE | Do they grow coffee? |
LANDAREA | Land area of farm (acres) |
LABOR | Labor usage |
INTERCROP | Intercrops with beans |
DECISIONS | Household decision responsibility |
SELLBEANS | Do they grow beans for sale? |
BEANSPLANTED_LR | Quantity of beans planted in long rain season |
BEANSPLANTED_SR | Quantity of beans planted in short rain season |
BEANSHARVESTED_LR | Quantity of beans harvested in long rain season |
BEANSHARVESTED_SR | Quantity of beans harvested in short rain season |
Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".
(You can use the arrow keys on your keyboard to scroll right in case the data table does not fit entirely on your screen)