Skip to Tutorial Content

Overview

In this tutorial we will be looking at one of the most widely used, and almost universally useful, applications of the R programming languages - data visualisation. We will do this by learning specifically about the plotting library ggplot2. There are three parts to this workbook, as well as a series of exercises and questions. You can complete the workbooks at your own pace, across multiple sessions, although note that progress through each part of the workbook only saves if you are using the same device. However you can skip back and forth at any point within each part using the navigation menu.

Overview of ggplot2

ggplot2 is also not the only additional plotting library that exists within R. There are many different ways of making plots using R, although these days ggplot2 is the most commonly used. Particularly if you are looking at older resources online you may see plots being produced using the base R plotting library, or sometimes in a different library called lattice. However, ggplot2 is an obvious choice for new users of R to start with - since the code is designed to be consistent to make, theoretically, any sort of plot you could imagine, the default style is visually appealing but plots are also easily able to be customised in lots of ways. There is a trade off for this, since ggplot2 uses a lot of slightly strange terminology which may take a while to learn and get used to.

'gg' stands for 'grammar of graphics', a term coined by the 1999 book by Leland Wilkinson. This is a system for defining graphs encouraging you to think about each of different components making up the graph, and providing specifications for those components. In this course we will focus on six of these components 'aesthetics','geometries','themes', 'facets', 'scales' and 'labels'. This specification makes you think clearly about exactly what you want to display, allows an extremely high degree of customisation both in 'graphical content' and in style, and makes it easy to provide visual consistency over lots of different graphs.

There are other components you can explore beyond this, but with these six components you will be able to make thousands of possible plots from your data. Both the ggplot2 syntax and the terminology can look confusing at first, but through practice once you understand the basic building blocks, you will be able to produce complex graphs very quickly and easily.

As a reminder, in this workbook there are a mixture of code chunks that you can run without needing to modify any code, and code chunks which are prefaced with a QUESTION, where you will need to make changes to the code in the chunk. Solutions to these questions are embedded within the chunk.

'BeanSurvey' dataset

The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.

The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey.

A summary of the columns in the dataset is below.

Column Description
ID Farmer ID
VILLAGE Village name
HHTYPE Household composition
GENDERHH Gender of Household Head
AGEHH Age of Household Head
OCCUHH Occupation of Household Head
ADULTS Number of Adults within the household
CHILDREN Number of Children (<18) within the household
MATOKE Do they grow matoke?
MAIZE Do they grow maize?
BEANS Do they grow beans?
BANANA Do they grow banana?
CASSAVA Do they grow cassava?
COFFEE Do they grow coffee?
LANDAREA Land area of farm (acres)
LABOR Labor usage
INTERCROP Intercrops with beans
DECISIONS Household decision responsibility
SELLBEANS Do they grow beans for sale?
BEANSPLANTED_LR Quantity of beans planted in long rain season
BEANSPLANTED_SR Quantity of beans planted in short rain season
BEANSHARVESTED_LR Quantity of beans harvested in long rain season
BEANSHARVESTED_SR Quantity of beans harvested in short rain season

Spend some time looking through the exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".

(You can use the arrow keys on your keyboard to scroll right in case the data table does not fit entirely on your screen)

First Plot

To start off we are going to look at producing some plots with just one variable at a time. Let's start with a simple bar chart showing the frequency of different villages within the dataset. The code has been written for you in the chunk below. Press "Run Code" in the chunk below to see the resulting plot.

ggplot(data = BeanSurvey,  aes(x = VILLAGE)) + 
  geom_bar()

Easy! Well, maybe not at first, since some of the commands we're using in that chunk look a bit strange. Let's start taking a look at each of the components within that code to understand what they all do.

Data

Let's break down each of the different parts of this command.

With ggplot2, you always begin a graph with the function ggplot(). This function provides the information which will be passed to all subsequent functions making up the plot - about the data and the aesthetic mapping being used.

You may be slightly confused as to why the function is just ggplot when the package is ggplot2, basically the original version of ggplot came largely from a PhD thesis and was not adopted for wide use. But once a more advanced version was developed it was decided to release this as ggplot2 rather than updating the original version.

And to make a plot you need some data, all ggplots must be made from an object of class data frame (or tibble) - we cannot plot anything which does not live inside of one of these data objects. So the first thing that we need to specify in the function is what data frame we are using. In this case the data frame we are using has been assigned the name BeanSurvey.

And we could just finish our code at this point and run this chunk of code, it is a valid and complete piece of code.

ggplot(data = BeanSurvey)

But it is totally useless by itself! We have told R to make a plot from the BeanSurvey data, but we have not told it what variables from the data we want to plot, or how we want to plot them. So we get a completely blank canvas returned to us.

This will lead us to specifying the two most important components of the ggplot - what variables we want to plot (the "aesthetic mapping") and how we want to plot them (the "geometries"). Every ggplot you will ever make will require you to specify as a minimum, the data, the aesthetics and the geometries. Everything else beyond that point is, to a certain degree, optional as there are default settings which are applied.

Aesthetic Mapping

If we go back to the original code we saw the part of the code which said aes(x = VILLAGE). 'aes' is the abbreviation for aesthetics. This is the word used to describe the different components of the plot onto which we can map different variables from our data. When thinking of most straightforward graphs the obvious aesthetics are the x axis and y axis. But we will also learn more about other aesthetics later, as we can also use our data to define colours, size of points, shapes, labels and so on. There are some plot types, like bar charts, we only need to specify one aesthetic as a minimum. So we can map the column VILLAGE on to the x aesthetic of our graph in preparation for making a single bar chart.

Again, we can run just this line on its own:

ggplot(data = BeanSurvey, aes(x = VILLAGE))

You will see this is also a blank canvas, as before, except this time the x axis has been labelled and plotted based on the unique values of the column VILLAGE from our dataset BeanSurvey.

This whole line has now build up the page on which we will draw a plot. But we need to tell R exactly how it should draw a plot which we do by setting a geometry.

Geometries

All geometry types are functions geom_xxx where the xxx is replaced by the shorthand for the plot type that you want to make. The geom function we want to use now for a bar chart is geom_bar

When we write the code for a ggplot, we think in terms of layering all of these different functions on top of each other. Each of the functions form a layer which are sequentially added to the plot. The canvas we have just set is one layer; the bars will be another layer. The components are separated using a +, and it is usually conventional to start a new line for each new layer, as this makes your code easier to read.

ggplot(data = BeanSurvey,  aes(x = VILLAGE)) + 
  geom_bar()

Note that the + must be placed at the end of the line, not at the beginning of the new line. If you forget to put the plus at the end of the line, R will see a complete and valid piece of code and run this, and then ignore anything which comes afterwards (usually giving a warning or error while doing so).

The y axis for geom_bar, by default, becomes the frequency of each unique category within the variable mapped to our x aesthetic. So from the plot we can see that in our survey we interviewed farmers from two villages, Kimbugu and Lwala, and that there were slightly more farmers interviewed in Lwala compared to Kimbugu.

Question: Modify the code to make a bar chart of AGEHH instead of VILLAGE
ggplot(data = BeanSurvey,  aes(x = VILLAGE)) + 
  geom_bar()
ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_bar()

Plots with one continuous variable

Age, unlike Village, is a continuous variable in our dataset. So a bar chart is probably not the best choice of plot when exploring our data. With this sort of data we might instead consider histograms, density plots or boxplots. These can be created easily by changing the geom from geom_bar to geom_histogram or geom_density or geom_boxplot or geom_violin, without really needing to modify our code in any other way.

ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_histogram()

It might be worth recapping the difference between bar charts and histograms if you are unsure of how these two types of plot vary, here

In this case R provides us with a message "stat_bin() using bins = 30. Pick better value with binwidth". When making a histogram, the numeric data is grouped together into "bins" of similar values. The default is to create 30 such groups, but R has identified that 30 may not be a sensible choice here for our data so the message is providing us with a helpful suggestion! We might consider setting the binwidth argument with geom_histogram as the message suggests, to provide age groups spanning 5 years.

ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_histogram(binwidth=5)
geom_density also works in a similar way.
ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_density()

The density plot is a lot like a histogram, except it applies a smoother function so that you can try to understand the underlying distribution of the variable you are examining.

So let's try the same again with the boxplot:

ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_boxplot()

It may be worth recapping the information presented within a boxplot, if it is not something you immediately remember, here

It shows us: * The median value, the black line through the middle. This is the point where half the values are lower, and half are higher. * The minimum and maximum values, the extremes of the line * The interquartile range, the edges of the box. These show us the points where 25% of the values are lower, and 75% higher at the bottom; and then the points where 75% of values are lower and 25% of values are higher at the top.

A sort of cross between a density plot and a boxplot additionally exists, known as a violin plot. As you may expect you can create this using geom_violin. Note this requires both and X and a Y axis variable but we can replace one of these with a 1 to create a single violin for the whole distribution.

ggplot(data = BeanSurvey,  aes(y = AGEHH, x = 1)) + 
  geom_violin()

With all of the plots we have made so far we have a choice - if we want the histogram, density plot or boxplot to be plotted horizontally we would specify just an x aesthetic. But with all of these geometries we could also have the same plot produced vertically by just using a y aesthetic.

ggplot(data = BeanSurvey,  aes(y = AGEHH)) + 
  geom_boxplot()

In most cases it is personal preference as to whether you prefer to see the plots horizontally or vertically, and different people may have different preferences.

We have shown multiple ways to plot a single continuous variable but there are many pros and cons to each approach. Ultimately the decision is up to you as a user but here are some pointers that may help.

Boxplot Violin Plot Histogram Density plot
+ Shows key summary statistics + Useful for visualising whole of a distribution + Useful for visualising whole of a distribution + Useful for visualising whole of a distribution
+ Useful for comparing multiple distributions + Useful for comparing multiple distributions + Tends to use frequencies so can more accurately identify the location of peaks or troughs + Provides a good idea of the shape of the distribution so can easily be compared to common theoretical distributions
+ Can be easily combined with a violin plot without losing information + Can be easily combined with a boxplot without losing information + Easy to interpret + Can plot multiple density lines onto one plot to compare
- Can be difficult to interpret for a less statistically literate audience - Density is hard to interpret so this gives a mostly visual clue - Requires separate plots to compare distributions - Can be difficult to interpret as density is not intuitive

Statisticians love boxplots. Although a single boxplot by itself is kind of boring - lots of boxplots side by side can be quite interesting and allow for comparisons. I wonder whether there is any relationship between gender of the household head and their age?

Plots with more than variable

Using the same code we used previously, we can add in a mapping of a categorical variable to our aesthetics, to produce multiple boxplots side by side. With geom_boxplot if we are specifying a y and an x variable then one of these must be a continuous numeric variable (like age or bean yield) and the other variable must be a categorical (ordinal or nominal) variable (like village or gender)

ggplot(data = BeanSurvey,  aes(y = AGEHH,x=GENDERHH)) + 
  geom_boxplot()

QUESTION: Try to interpret the results of this boxplot. Do we see differences in the distribution of ages between male and female household heads. If so, what?

Trying to include another numeric variable, like CHILDREN, instead of GENDERHH, will not give an error, but the plot will just ignore that x variable and make a single box and a warning message will appear.

ggplot(data = BeanSurvey,  aes(y = AGEHH,x=CHILDREN)) + 
  geom_boxplot()

Not every mistake you make in R will result in an error! In this case we have produced a plot which is almost certainly not what we wanted. Although again, R has given us a message to try to help us here: Continuous x aesthetic -- did you forget aes(group=...)?.

So to fix this problem, within the aes() we should include group=CHILDREN as well as x=CHILDREN.

ggplot(data = BeanSurvey,  aes(y = AGEHH,x=CHILDREN,group=CHILDREN)) + 
  geom_boxplot()

This plot is quite interesting. Households with no children living within the household tend to be much older on average. This makes a lot of sense, since many of these would be farmers who do have children, but that are no longer aged under 18 or no longer living in the household. Then we can see an increasing trend in the average age as the number of children living in the household increases. You can also see that the 'box' for 6 children is actually a line! This will happen if there is only one single observation within a group, or if all the observations within the group are the same.

Note that, if we had only included the group=CHILDREN without the x=CHILDREN we still would have had an issue. This time the separate plots would have been created, but the axis would not have been labelled correctly.

ggplot(data = BeanSurvey,  aes(y = AGEHH,group=CHILDREN)) + 
  geom_boxplot()

Using this same set of variables we can now also see what a proper violin plot is supposed to look like. Essentially we are plotting the density plot seen earlier but turned upwards and reflected back to back.

ggplot(data = BeanSurvey,  aes(y = AGEHH,x=GENDERHH)) + 
  geom_violin()
ggplot(data = BeanSurvey,  aes(y = AGEHH,x=CHILDREN,group=CHILDREN)) + 
  geom_violin()

However here we get into one of the disadvantages of violin plots relative to boxplots. While violin plots are useful because they give an idea of the whole distribution in a similar way that histograms and density plots do and they can be used to compare multiple groups, violin plots hit a limit on how many groups are readable pretty quickly. Whereas boxplots can be compared across many many more groups. The violins become very thin as they grow in number, making them impossible to interpret.

However, when dealing with smaller numbers of groups they can be very useful as boxplots tend to hide information about the distribution because it focuses on those key summary statistics. A boxplot may hide key information about an abnormal distribution such as a bimodal distribution with two distinct peaks because the average would lie between them, despite few observations being close to it.

Question: Now modify the code below to produce some boxplots to investigate if there is a relationship between LANDAREA and VILLAGE. Make sure you assign the variables to the appropriate aesthetics

ggplot(data = BeanSurvey,  aes(y = AGEHH,x=GENDERHH)) + 
  geom_boxplot()
ggplot(data = BeanSurvey,  aes(y = LANDAREA,x=VILLAGE)) + 
  geom_boxplot()

geom_point

We can now try and create a scatter plot, using geom_point(). However if we take the code from the first example and modify geom_bar to geom_point we find an error.

ggplot(data = BeanSurvey,  aes(x = AGEHH)) + 
  geom_point()

This error is telling us that with have a "missing aesthetic: y". This is because we need to have both an x and a y variable to produce a scatter plot. Ideally for a scatter plot we need two numeric variables; so let's take a look at age on the x axis and land area on the y axis.

ggplot(data = BeanSurvey,  aes(x = AGEHH,y=LANDAREA)) + 
  geom_point()

Nothing immediately obvious is jumping out as a relationship between these variables!

As we have seen with geom_point() and geom_boxplot(), different geometries will have different requirements for which 'aesthetics' it needs as a minimum requirement, and whether these variables should be numeric or categorical or something else. The ggplot2 cheat sheet is a great resource to help work through this as is the R graphics cookbook

Our plots are all a little basic so we should learn some more R so we can start to make them a bit more customised and impressive. In part two we will start to learn more about applying colours and scales to our plots. But first, let's practice with some examples of using just the data, aesthetics and geometry components to see how we can explore the data more and produce lots of different plots!

You can go through the fist set of exercises here

Or if you feel comfortable you can move onto Part 2 here

Appendix: Useful Graphical Presenation (General use) Resouces

Our team at Stats4Sd have throughout the years developed plenty of resources on data visualisation that you may find useful.

Here is an extensive guide on the general presentation of tables, graphs and maps: https://stats4sd.org/resources/555/

And here is an extensive guide on presenting research results in many forms including tables, graphs, posters and presentations: https://shiny.stats4sd.org/PresentingResults_Book//

This guide came with numerous youtube videos which can all be found on the Stats4SD youtube channel https://www.youtube.com/channel/UCs7EU95YMjhvNozJKCD92xQ//

Making Graphs in R Using ggplot2 - Part 1