Relational Data in R

Overview

In this workbook we will be exploring the concept of relational data and how we can manage these in R.

Relational data is a concept common to data science, it describes a collection of data sets that can all be linked together through various relationships. Usually these different tables are at different hierarchies dealing with different units of analysis, such as one at a community level, one at household level, one at a household member level and so on..

Within the data tables, these relationships are defined by unique unit identifiers known as keys. There are two types of keys;

Primary keys uniquely identify each row of a given data set.
Foreign keys are variables in a dataset which identify a particular unit, these will not be unique to each row of that table. Instead they can be used to link to other tables. They are usually going to be the primary key of another dataset.

For instance, you could have one dataset of farms...

Farm_ID (Primary Key)	Variables ...
1	...
2	...
3	...

Which can be linked to a table of plot data

Farm_ID (Foreign Key)	Plot_ID (Primary Key)	Variables ...
1	1	...
1	2	...
2	3	...
2	4	...
3	5	...
3	6	...

Whenever you are dealing with projects that will involve data collected/available at multiple hierarchies, it is highly recommended you use relational datasets. For instance if you are collecting data in a longitudinal/repeated measures project, you could have a dataset for each point of data collection that could be linked through unique identifiers.

Because of these relationships we are able to easily merge data from different tables together. These relationships will take one of three forms.

1-1 relationships. Each row in table A can only link to one row in table B. They are using the same primary key. Such as in longitudinal research.
1-Many relationships. Each row in table A may link to one or more rows in table B, but each row in table B can only link to one in table A. This is where foreign keys will come into play. Such as linking individual farmers in one table to all of their plots in another. They can have more than one plot in table B but only one entry in table A.
Many-Many relationships. A bit more technical, but this is where each row in table A can link to multiple rows in table B, but likewise each row in table B can link to multiple rows in table A. For example, each plot can contain many plants, and many plant species can be grown on many plots.

From these relationships, we also have many different types of merging

Column binding. This sticks together two data tables that have the same number of rows, basically appending additional variables to the same set of observations.
Row binding, this is appending additional rows/observations onto an existing data set. These must have at least mutual column names or the same columns exactly.

Joins are for merging information generally about different units. Such as appending information about farmers onto the data about their plots. Therefore matching between the keys drives this method.

Full joins will keep all rows from both data sets, even where rows that could not be matched.
Left/Right joins will keep all rows from one of the two datasets, depending on which direction is specified in the join.Any rows from the secondary dataset that are not matched will be dropped from the output.
Inner joins will keep only those rows from where a match was found.

Additionally you can use these joins following data aggregation to bring information up a level, for instance you can calculate the average size across all plots for each individual farm within your plot table and then merge this information up to your farm table.

We can also do the reverse and bring a variable down from a higher level during analysis. For example, say we wanted to analyse average plot sizes between male and female farmers, we can bring the sex variable down from farms to plot level. Then group plots by the sex of the farmer, and take the average

In this workbook, we will look through first how to group and summarise your data before taking an extensive look at how we can start merging some of our data together.

Data

For this workbook we will be using four relational data tables, each storing data regarding different units of analysis. However each and everyone can be linked across this hierarchy.

At the top level we have some data about the Villages in our study

Below this we have information on a number of farmers who took part in our study

Further down still, we have our plot level data

Lastly, we have some further information on fertilizers.

Aggregating data

With any data analysis, at some point you are going to have to summarise your data in some way. Sometimes you may need to even do this prior to analysis as part of your data cleaning process. such as for the generation of new variables.This is certain to be true if you are handling relational data.If you have taken our module on data manipulation you will already be partly familiar with some of these steps.

More often than not, you will need to do this summary by some sort of dis/aggregation variable.

In order to do that in R, you first need to know how to group your data.

Group_by

Grouping data in R with dplyr is very simple.

You use the group_by function to create implicit groupings within your data.

Lets first look at grouping our Plot data by the farmers who tend them.

We simply provide the name of the variable we want to use to define our groups.

Plot_data <- Plot_data %>%
  group_by(farmer_id)

Plot_data

Looking at the data, it will look like nothing has changed. This is because the groups are implicit, it is in essence now part of the metadata of the data. But R will recognise that this grouping exists within the dataset and perform operations accordingly.

Indeed if we take a look at the structure of our data using str, which displays the structure of an R object, we will see we will see that rather than our data being a data.frame (the standard data format in R), it is now a grouped_df (grouped data frame). From this we can see that we have created 19 groupings in our data and we can see for each group which rows belong to it.

So our first group (farmer 1) has 3 rows in the data, and they are rows 6,10 and 38.

str(Plot_data)

If for any reason you want to then get rid of this grouping, perhaps at a later point you want to summarise across your whole data and not groups, or by a different grouping variable entirely.

Then we can just add in ungroup and our data will be ungrouped.

Plot_data <- Plot_data %>%
  ungroup()

Summarise / grouped mutates

Now that we know how to group our variables, let's look at how we can start aggregating our data.

The simplest way to start creating data summaries is to use the summarise function from the dplyr package.

This will generate summary statistics that you define for each of the groups in your data, or for your whole dataset if your data is not grouped.

In this case we have decided to calculate the total area of all of the farmers plots. As we want a total, we use the function sum to sum up all the area sizes. Note that we have added the argument na.rm = TRUE this makes sure that any missing values are removed from the calculation. If this is not included and there is missing data, then our result would just read NA and not give an actual number.

Sum1 <- Plot_data %>%
  group_by(farmer_id) %>%
  summarise(total_area = sum(size, na.rm = TRUE))

Sum1

When using summarise, the number of rows returned will be equal to the number of groups in your data. So we had 19 groups, so 19 rows. If our data was not grouped then we would have had only 1 row returned. Also when summarising data, the resulting data will only contain the variables used to group your data and the summary variables you have created. All other variables will be dropped.

Our resulting data has moved data from the plot level and summarised it up to the farmer level.

Just like mutate we can start creating many different summary variables by separating the calculations with a comma.

In this example, we have additionally created summaries for the average size of a farmer's plots, the standard deviation of plot size and finally we have used n() to generate a variable counting how many rows are in each group. Or in other words, how many plots each farmer has. Note that we have NA values for sd as a this has happened where the farmer has only 1 plot, therefore a standard deviation cannot be calculated.

Sum2 <- Plot_data %>%
  group_by(farmer_id) %>%
  summarise(total_area = sum(size, na.rm = TRUE),
            avg_area = mean(size, na.rm = TRUE),
            sd = sd(size, na.rm = TRUE),
            nplots = n())

Sum2

There are a number of other summarise functions that can be used to apply the same function to multiple columns rather than go through one by one. For more on this, please follow this link

You can also use mutate to generate these same variables, while keeping your data set at the plot level.

We keep everything within the function the exact same.

All we do is swap summarise for mutate

We have added in an arrange to sort the data now on the farmer_id rather than plot_id, just to make this demonstration a little clearer.

Note that we have kept all of our plot specific data and we still have 50 rows. But now we also have variables for total_area, avg_area, sd and nplots. The results of which are the same as what was calculated by summarise, but instead we see these results replicated for each row within a group.

Sum3 <- Plot_data %>%
  group_by(farmer_id) %>%
  mutate(total_area = sum(size, na.rm = TRUE),
            avg_area = mean(size, na.rm = TRUE),
            sd = sd(size, na.rm = TRUE),
            nplots = n()) %>%
  arrange(farmer_id)

Sum3

This could be useful for creating a new variable that is at one level, but is also dependent on a variable from a level higher up. For example, by keeping our data at plot level we could generate a new variable that is the proportion of the total area that each plot represents across each farmer.

Sum3 <- Sum3 %>%
  mutate(plot_area_prop = size/total_area)

Sum3

Binding data

As mentioned in the introduction, we have two types of ways to merge our data. Binding and joining.

First let's look at binding this involves combining data tables which are very similar, they will not contain related information from different levels but generally additional data that is for the same set of observations or additional observations for the same set of variables.

The two data tables would therefore have a very similar structure. We are not trying to match data from two separate tables but rather add to an existing table.

There are two ways that we can do this. By rows or by columns.

Row binding

Row binding would be used when you have two or more datasets containing the same variables, the difference is that they contain separate sets of observations. For example, you could have data collected in one location and data collected in another but this data has been stored apart. They contain the same variables as the data collection tool was identical but different observations. You can bind these datasets together by their rows.

Let's pretend that our farmer data was originally stored into different data sets and now we want to combine them together

First we have farmers 1 through 12

Then farmers 13 through 20.

It is thankfully very simple to bind these two datasets as we have the same number of columns in our data, and they have the same names.

We can use the base r function rbind to simply achieve this binding. All we need to do is add the names of the data sets we want to bind.

rbind(Farmer_dataA, Farmer_dataB)

Now with rbind there is a little issue that occurs if there are variables in one dataset that aren't in the other. Let's add an age variable to our first dataset but not the second and see what happens when we try to merge them

rbind(Farmer_dataA_Age, Farmer_dataB)

We get an error instead. This is because rbind requires that there are exactly the same number of columns in both data sets.

To get around this issue, there is the bind_rows function from dyplr. This matches the columns by their names (so make sure those are the same in both datasets still), and if there are any column not present in both datasets, it will just fill this with NA in the data set where it is not present.

bind_rows(Farmer_dataA_Age, Farmer_dataB)

Column binding

Column binding on the other hand would be utilised when we have the same set of observations, but the variables are different. Perhaps we have taken additional measurements at a later point and want to bring this together with the original data.

For example, let's look to our plot data and bring in data on the yield of the crops grown on each plot. Data we have collected at a later point.

For column binding, as you may expect the base r function is cbind and the dplyr alternative is bind_cols. Unlike with row binding, there is not really any difference between how the functions operate.

Both require the same number of rows in order to operate. These should also be in the same order, if they are not you could use arrange first to make sure that they are.

cbind(Plot_data, Plot_data2)

and with bind_cols

bind_cols(Plot_data, Plot_data2)

Now you will notice that because both datasets contained plot_id, our resulting data table has two id columns unhelpfully names "plot_id...1" and "plot_id...6". This is because column binding will not merge information from columns that have the same name, rather they will just change the names or in the case of cbind, simply repeat them.

Therefore, if we were to use column binding it would be a good idea to drop the plot_id from one dataset and then perform the bind.

We can use select and then put a - before the name of the variable to remove it from the data

Though of course this stresses the importance that these ids MUST be identical in both data sets. The same numbers, and those numbers mean the same thing.

Plot_data3 <- Plot_data2 %>%
  select(-plot_id)

Now when we bind the data sets. We keep plot_id as normal

Plot_data <- bind_cols(Plot_data, Plot_data3)

Plot_data

This binding may have been more smoothly achieved if we had actually done a join instead as we shall see later in the workbook.

Full join

Joining entails the more traditional form of data merging, we are bringing together data from multiple related data tables and these data tables do not contain the same levels of information.

In a full join the resulting data table will contain all the rows from both data tables whether there was a match between them or not. Everything is retained, nothing is lost

This could be useful when merging data from 2 separate survey rounds, as the 2nd round could include people who were not interviewed in the 1st or be missing people from the 1st round due to attrition.

If we don’t want to lose those who were missing at one of the points. A full join would keep them all

Let's again look at merging our two pieces of plot data but rather than having yield data for all of our plots, we have it for 40 out of 50 of them.

Plot_data

Plot_data4

From the dyplr package we would want to use the full_join function.

We first specify the two datasets we want to merge.

Then we use the by = argument to tell R what variables are we using to merge this data. In this case we want to use the primary key of these two data sets, plot_id. Remember that a primary key is the key that uniquely identifies each row in your data. With our plot level data sets, each row is uniquely identified by that plot identification number.

full_join(Plot_data, Plot_data4, by = "plot_id")

We can see how this was quicker and simpler than trying to achieve the same result with column binding.

Note that as both pieces of data are at the plot level, we had a 1 to 1 relationship between the data sets. For every plot in table 1 there will be no more than 1 plot in table 2.

Now notice that there are some NA values for yield. This is because there was not a match between the two data sets. We did not have yield for plots 6, 8, etc. But because we used a full join, these rows are kept and not excluded. All data remains intact and we still have 50 rows.

Let’s try this again with an additional plot in Plot_data4, that does not exist in plot_data.

As we know it is our 51st plot, let’s take a look at the details of this plot only. We can see that it has a yield of 320 kg/ha.Note that is is row 41 not 51 because our Plot_data4 was missing 10 rows.

Plot_data4%>%filter(plot_id == 51)

When joining this data now, plot 51 is kept in the data, but we would have all this missing data including farmer_id. Not always particularly helpful.

Plot_data5 <- full_join(Plot_data, Plot_data4, by = "plot_id")

Plot_data5%>%arrange(desc(plot_id))

I tend to find full join most useful when using it in a similar fashion to rbind, where I have similar datasets with many of the same variables but additional observations and perhaps even additional variables for existing observations.

Let’s have another look at merging two instances of our farmer data. This time it’s slightly different: only some of the rows are shared between the two datasets and one dataset has an additional variable, household size.

Farmer_dataC

Farmer_dataD

So in our second data set we have 2 additional farmers Sam and Dave, and then also additional household size information for farmers 9 to 12.

So let's join these together and see what happens

full_join(Farmer_dataC, Farmer_dataD, by = "farmer_id")

Now we have 22 rows as we would expect, one for each farmer. But something has gone wrong with our village and name variables. This is because we only told R to match on farmer_id. While this logically makes sense, it means that R will think that other column with the same name are not identical, when in our case they are. As a result we get columns like name.x and name.y.

In order to avoid this we also need to include any common columns between our two data sets by listing them.

full_join(Farmer_dataC, Farmer_dataD, by = c("farmer_id", "village_id", "name"))

This seems to solve the problem and we have kept information from both data tables even where we were missing household size information.

If our variables were called something different then we can control for this as well. For example, say in our previous example, in the second dataset plot_id was instead called plot_name. Then we could have written this instead. with the column name in the first data set on the left, and the other column name on the right.

full_join(Plot_data, Plot_data4, by = c("plot_id" = "plot_name"))

Left/right join

A left or right join will keep all rows from whichever is deemed to be the primary dataset. With regards to the function,for a left join this means it will keep everything from the data set written to the left. A right join will keep all rows from the dataset to the right.

This is regardless of whether or not there is a match.

So if we were to join the two sets of plot data that we had on the previous page. We could use a left join to stop us from adding in that 51st plot for which we have very little data.

left_join(Plot_data, Plot_data4, by = "plot_id")

We have kept all 50 rows from the first set of plot data, including those without any yield data in the second set. All while removing that 51st plot from our data.

If we were to instead keep the arguments the same but change this to a right join we would see 41 rows. The 40 plots for which we have yield information and also the other variables. Plus the other plot for which we only have yield data. Therefore removing the 10 plots where we have that original information but not yield.

right_join(Plot_data, Plot_data4, by = "plot_id")

Inner join

An inner join trims this down further still and will only keep the rows where there is a corresponding match between each data set.

So in our plot example we would drop both the plots which have no yield data, and the 51st plot for which we have no other information.

inner_join(Plot_data, Plot_data4, by = "plot_id")

As a result we have 40 rows. Because we had 40 plots were information was available from both datasets.

Brining it all together

Finally, let's end by bring these concepts together. First by merging data down a level, and then an ambitious example of bringing data up right down from the bottom level (fertilisers) right up to the top (villages)

Example 1

First let's bring some data down a level. For instance let's say we want to analyse some data at the plot level but we want to group this data by village to generate a village level average plot size.

But we do not know which farmers lives in which village just by looking at this plot level data.

But we do have farmer_id which can link us up with the farmer data. A data set which does include the farmer data, which includes the village id as a variable.

So we need to merge information from the farmer data and the plot data using this link. Here our foreign key is farmer_id as it does not uniquely identify our rows in the plot data. But it is identifying a particular unit of observation, the farmer. This foreign key (farmer_id) links up to the primary key of the farmer data (farmer_id).

This is an example of a 1 to Many relationship as for each individual farmer there can be many plots.

Now you may have noticed that we have 20 farmers in our data, but when we have grouped our plots by farmer, we had only 19 groups. That is because we have one farmer without any plots.

This helps inform our decision on what type of join we need to use.

We are primarily concerned with our plot data, so those are the rows we want to keep.We don't want additional rows with little information of use. So we do not want to use a full join because we would get a row for this additional farmer despite them having no plots in our data. This row would be useless.

In this example we don't have any plots belonging to farmers not in our data, but if this were the case, again we would have rows with unhelpful information as we would not be able to identify which village they lived in because they are not in the records.

So in this case, we need an inner join. Because we only want to keep the rows that have a match, because then we will have both plot size and village information.

Plot_data <- Plot_data %>%
  inner_join(Farmer_data, by = "farmer_id")

Plot_data

So we have successfully brought down the farmer data, including both the village_id and the farmers names.

We could extend this further and bring down the village level information as well.

Plot_data <- Plot_data %>%
  inner_join(Village_data, by = "village_id")

Plot_data

Now we can use group_by and summarise to calculate that village level plot average

Plot_data %>%
  group_by(village_name) %>%
  summarise(avg_plot_size = mean(size, na.rm = TRUE))

Example 2

Now lets start moving data the other way. In the village level data, I want to calculate the average total expenditure on fertilisers. So I have to start with fertiliser data and bring this all the way up to the village level.

Firstly I of course need to match fertilisers and plots. This is actually an example of a many to many relationship because a plot can have up to two fertilisers and a fertiliser can be used on many different plots.

Now as our plot data is at that plot level we actually have the names of the two fertilisers in separate columns. Therefore we cannot match them straight away, We could do it in staggered steps, first bring in the costs for fertiliser 1 and then the costs for fertiliser 2. Though we would then have to reshape the data to long so we could join those two costs columns together.

It would be simpler if we actually reshaped our plot data first and then merged it with the fertiliser data.

So using what we learned in the previous tutorial Let's create a long version of our plot data.

Long_plots <- Plot_data %>%
  pivot_longer(
    cols = fertiliser_1:fertiliser_2,
    names_to = "Fertiliser_no",
    values_to = "fertiliser_id"
  )

Long_plots

Now we can bring in the fertiliser data. Again using left join, as we want to prioritise the plot information.

Long_plots <- Long_plots %>%
  left_join(Fertiliser_data, by = "fertiliser_id")

Long_plots

Okay so we now have our price information in the data. However there is one further step we need to make before we can go further up the levels of this data. Price is calculated at per ha, so to get total expenditure we need to multiply this by the size of the plot. we can do this using mutate

Long_plots <- Long_plots %>%
  mutate(price  = price_per_ha * size)

Long_plots

Now we can start summarising and bring the data up in sequential steps.

Starting with going from that long plot data, technically at the fertiliser level, back to normal plot level.

We can do this all in one pipe.

Starting with long plots we group by plot_id.

Then we summarise to calculate a plot level total expenditure on fertilisers.

Finally we right join onto our plot data. We use a right join because we start with long plots and pipe through therefore it will always be the first (left) argument in that join. But we only want the rows from the plot data so we want to join to the right.

Plot_data <- Long_plots %>%
  group_by(plot_id) %>%
  summarise(price_total = sum(price, na.rm = T)) %>%
  right_join(Plot_data, by = "plot_id")

Plot_data

We now simply repeat the process to bring this up to farmer level. Simply switching out our data arguments and change to use the appropriate keys that link the data sets.

Farmer_data <- Plot_data %>%
  group_by(farmer_id) %>%
  summarise(price_total = sum(price_total, na.rm = T)) %>%
  right_join(Farmer_data, by = "farmer_id")

Farmer_data

Finally, we repeat this one more time to bring it all up to the village level.

Village_data <- Farmer_data %>%
  group_by(village_id) %>%
  summarise(price_average = mean(price_total, na.rm = T)) %>%
  right_join(Village_data, by = "village_id")

Village_data

If we wanted to get fancy this could have all been done in the one pipe.

Note that as we start with Plot_data we do not need to join on to it like we did before, and because plot data already has the farmer_id we can go straight to summarising the data at that level first before moving it upwards to village.

Plot_data %>%
  pivot_longer(
    cols = fertiliser_1:fertiliser_2,
    names_to = "Fertiliser_no",
    values_to = "fertiliser_id"
  ) %>%
  left_join(Fertiliser_data, by = "fertiliser_id") %>%
  mutate(price = price_per_ha*size) %>%
  group_by(farmer_id) %>%
  summarise(price_total = sum(price, na.rm = TRUE)) %>%
  right_join(Farmer_data, by = "farmer_id") %>%
  group_by(village_id) %>%
  summarise(price_average = mean(price_total, na.rm = TRUE)) %>%
  right_join(Village_data, by = "village_id")

External links and resouces

R for data sciene - relational data

Relational Databases video

Relational Databases video 2

Data school

Displayr

Data merging in Excel

Dyplr Joins help page

Blog post on joins with dyplr

Summarise with dyplr

Summarise with dyplr - Multiple columns at once