Chapter 7 Merging datasets

7.1 Set Up for Session

In this session, we are going to use the following packages; dplyr, ggplot2, openxlsx, tidyr, plotly. If you get an error saying there is no package called _____, install the package as covered in the first tutorial.

library(dplyr)
library(ggplot2)
library(openxlsx)
library(tidyr)
library(plotly)

We have used the following datasets in this tutorial. Import them as covered in the previous tutorial UKIreland.csv, UKIreland.xlsx, UN_LivestockData.csv, WorldBankData.csv

7.2 Binding

After sucessfully loading the datasets, there should be seperate data frames of the UK data and Ireland data that we used in the previous tutorial. Ideally we would like to combine these together into a single data frame.

In this case the two data frames have the same structure; the columns are the same, the column names are the same and there are the same number of rows (one for each year).

We can use the rbind() function to merge the data frames together. rbind() attatches the rows of the second data frame below the first data frame.

UKIreland<-rbind(UK,Ireland)

rbind() only works because the UK data has the same columns as the Ireland data. For example, we will get an error if we had calculated a new column in one dataset but not in the other:

UK2 <- mutate(UK,pop_mill=population/1000000)

rbind(UK2,Ireland)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

There is also a function called cbind. This binds the columns together.

UKIreland2<-cbind(UK,Ireland)

cbind() will only work if we have the same number of rows in each data. cbind also produces a slightly confusing dataset - we now have two sets columns called “population”, “country” and so on; this makes it very difficult to work with!

rbind() is used often but cbind() is not since to be useful the two datasets need to have the same number of rows with each row describing the same thing. There are more functions to allow more flexible types of merging of datasets.

7.3 Introducing merging

dplyr has a number of clever functions for merging data.

Join function Description
full_join() merges two datasets and keeping all observations from both datasets
inner_join() merges two datasets and only keeping the matching observations
left_join() merges two datasets and keeping all observations from the first (“left”) data set but only the observations that match from the second (“right”) data set
right_join() merges two datasets and keeping all observations from the second (“right”) data set but only the observations that match from the first (“left”) data set
anti_join() identifying rows present in the first (“left”) dataset which do not have a match in the second (“right”) dataset.

We will work through these sequentially. The function for merging all the data is full_join. The key identifier column between UK and Ireland would be the “date” column.

full_join(UK,Ireland,by="date")
##         country.x date    gdp.x population.x unemployment.x capital.x
## 1  United Kingdom 2000 27982.36     58892514       5.450000    London
## 2  United Kingdom 2001 27427.59     59119673       5.083333    London
## 3  United Kingdom 2002 29785.99     59370479       5.175000    London
## 4  United Kingdom 2003 34173.98     59647577       5.008333    London
## 5  United Kingdom 2004 39983.98     59987905       4.750000    London
## 6  United Kingdom 2005 41732.64     60401206       4.841667    London
## 7  United Kingdom 2006 44252.32     60846820       5.416667    London
## 8  United Kingdom 2007 50134.32     61322463       5.333333    London
## 9  United Kingdom 2008 46767.59     61806995       5.708333    London
## 10 United Kingdom 2009 38262.18     62276270       7.608333    London
## 11 United Kingdom 2010 38893.02     62766365       7.891667    London
## 12 United Kingdom 2011 41412.35     63258918       8.091667    London
## 13 United Kingdom 2012 41790.78     63700300       7.991667    London
## 14 United Kingdom 2013 42724.07     64128226       7.591667    London
## 15 United Kingdom 2014 46783.47     64613160       6.200000    London
## 16 United Kingdom 2015 44305.55     65128861       5.375000    London
## 17 United Kingdom 2016 40341.41     65637239       4.900000    London
##    country.y    gdp.y population.y unemployment.y capital.y
## 1    Ireland 26241.51      3805174       4.266667    Dublin
## 2    Ireland 28227.28      3866243       3.925000    Dublin
## 3    Ireland 32539.95      3931947       4.491667    Dublin
## 4    Ireland 41107.03      3996521       4.616667    Dublin
## 5    Ireland 47630.93      4070262       4.491667    Dublin
## 6    Ireland 50878.64      4159914       4.400000    Dublin
## 7    Ireland 54306.91      4273591       4.525000    Dublin
## 8    Ireland 61359.64      4398942       4.691667    Dublin
## 9    Ireland 61257.90      4489544       6.433333    Dublin
## 10   Ireland 52104.04      4535375      12.050000    Dublin
## 11   Ireland 48671.89      4560155      13.908333    Dublin
## 12   Ireland 52224.01      4576794      14.675000    Dublin
## 13   Ireland 49177.44      4586897      14.725000    Dublin
## 14   Ireland 52060.47      4598294      13.091667    Dublin
## 15   Ireland 55899.16      4617225      11.316667    Dublin
## 16   Ireland 62139.67      4676835       9.458333    Dublin
## 17   Ireland 63861.92      4773095       7.908333    Dublin

Notice that instead of duplicated names for the columns, we now have a suffix .x and .y. It might be useful to have a more informative suffix so that we don’t have to constantly remember which country is x and which country is y.

full_join(UK,Ireland,by="date",suffix=c("_UK","_IRE"))

QUESTION: In the previous chunk what we would expect to see if we were to replace full_join() with inner_join()?

7.4 Applying merging

7.4.1 Piping into ggplot (+ some new geoms)

We could now make a plot of GDP in Ireland against GDP in UK. We can pipe data into a graph in a similar way to piping data commands that we saw yesterday.

full_join(UK,Ireland,by="date",suffix=c("_UK","_IRE")) %>%
  ggplot(aes(y=gdp_UK,x=gdp_IRE)) +
    geom_point()

Remember that once we have started the ggplot() we use the + to link together the components of the graph. The %>% is used to make modifications to the data.

We can also assign our plots to be named objects, similar to how we have named data frames so far. This can be a good way of sequentially building up a plot. But to actually see the plot we need to repeat back the name.

plot1<-full_join(UK,Ireland,by="date",suffix=c("_UK","_IRE")) %>%
  ggplot(aes(y=gdp_UK,x=gdp_IRE)) +
    geom_point()

plot1

Maybe we want to show the time dimension on this graph in some way, as each point represents a year. Yesterday we used geom_line() to make line graphs but in this case it may not be so useful. geom_line joins up the points from left to right. But that may not be the same order as it is from year to year. So instead we can use the function geom_path, which joins the points in data order. If we want to be sure that the data order is sensible, we can make sure we arrange it by date first.

plot1+
  geom_path()

To show this even clearer we could label the points. geom_text() adds text to the graph. and the text column needs to be assigned to an extra aesthetic called “label”.

plot1+
  geom_path()+
    geom_text(aes(label=date))

QUESTION:
Modify the previous code to produce the same plot but; 1. colour the text labels red 2. place the labels above the points instead of on top of the points; 3. make the points bigger; 4. use a dotted line instead of a full line. 5. You might need to look into the help menus for some of these geoms to find the appropriate options, or remember the (probably even more useful) R Graphics Cookbook: http://www.cookbook-r.com/Graphs/

plot1+
  ???

Another way of helping identify which point is which would be to use an interactive graph using the plotly library. This is a really easy way of making a really nice interactive graph.

To use plotly you need to assign your plot to an object, and then run ggplotly() around that object. Any information you want to be available in your interactive plot needs to be mapped to an aesthetic in the ggplot statement.

ggplotly(plot1)

But these don’t have to be real geoms, you can use any name and then this will be carried through to the interactive plot. E.g. - this code (suprisingly) works:

plot2<-full_join(UK,Ireland,by="date",suffix=c("_UK","_IRE")) %>%
  arrange(date) %>%
   ggplot(aes(y=gdp_UK,x=gdp_IRE,sgndskjgfbsljk=date)) +
    geom_point()+
      geom_path()

ggplotly(plot2)

7.4.2 Challenging Merging

Let’s bring in another dataset, from a different source, to do some more interesting, and difficult, merging tasks.

The Livestock data contains data from a similar period, for all countries in the world, on the number of different livestock units within that country.

Click on the dataset in the environment window to take a look at the data.

Let’s try and merge this with the combined data from UK & Ireland (UKIreland).

QUESTION: Take a close look at these two datasets. Can you see any possible issues with trying to merge these datasets together?

These issues can be resolved by having i) multiple key join fields, brought together using c() ii) changing the specification slightly - insteady of by=“key_column_name” we can have by=c("key_column_name_from_1st_data"="key_column_name_from_2nd_data").

Before running the code try to answer the following question:

QUESTION: How many rows and columns would we expect to see in the result of each of these join statement. HINT: Check the Environment tab for lthe size of the data frames.

full_join(UKIreland,Livestock,by=c("country"="Area","date"="Year"))

inner_join(UKIreland,Livestock,by=c("country"="Area","date"="Year"))

anti_join(UKIreland,Livestock,by=c("country"="Area","date"="Year"))

anti_join(Livestock,UKIreland,by=c("Area"="country","Year"="date"))

QUESTION: We are interested in comparing the livestock populations of Ireland and the UK. Which of these options would be the most useful of these joins to use for further analysis? Assign the most useful join to an object called UKIrelandLivestock

UKIrelandLivestock<-