Overview
Welcome to this first workbook in the R for Data Management Module.
During these workbooks you will not need to be writing any of your own R code but will see examples of code that should help you in your own projects.
“Data Scientists spend up to 80% of the time on data cleaning and 20% of their time on actual data analysis”
Data is rarely perfect from the start, even if you have sourced it from elsewhere you may have certain needs that the original data organiser did not factor for.
Therefore, you will likely need to spend sometime cleaning and manipulating your data. If you have already taken our module on data manipulation using dplyr then you will already be familiar with some of the methods shows in this tutorial.
Data cleaning can be a tedious but ultimately crucial aspect of your workflow. The end product of a data cleaning process should always be what we refer to as 'tidy data'
This is a concept crucial to the whole tidyverse approach to R. It provides a proscribed set of guidelines for the arrangement of your data. It essentially can be boiled down to;
- Each variable is a column
- Each observation is a row
- Every cell is a single value
Some of these rules may seem like common sense, but you can often be surprised how complicated and messy some pieces of data can be if not managed properly. Especially if this data has been stored in Excel.
This workbook along with the other two in this module, will teach a few tips and tricks on how to deal with your data and keeping it in order. In this first workbook we will cover how to clean and check your dataset by checking values and dealing with missing data as well as how to create new variables and recode existing ones.
Subsequent workbooks will go further by teaching you first how to reshape a dataset into different formats (Long vs Wide) before then looking at how to deal with multiple related datasets and how these can be brought together.
Tidyverse
A big advantage of R is that it is open source.Therefore the packages which people have designed to accomplish certain tasks are freely available for anyone to download and learn how to use.
In recent years, there is a curated collection of packages which has become immensely popular.The tidyverse. This is a range of packages quite closely linked to this established concept of "Tidy data" as developed by Hadley Wickham. In fact he literally wrote the book on the subject
The tidyverse contains a wide array of packages with different purposes, but the general philosophy is keeping your data and your code clean, readable and easy to understand.
For our purposes during this module of the course, our main 2 packages will be dyplr and tidyr.
Dplyr is primarily concerned with data manipulation, from this package we will be looking into how to create new variables but also edit existing ones.In a later session we will also look at merging data. For more information on this package visit this page
Tidyr on the other hand is about some of the primary rules about Tidy data as the name implies. This includes functions to reshape your data and deal with some missing values. These will be our primary focus from this package but it does have other capabilities as well. Visit this page for more information
Dataset
For the purpose of this session workbook I have generated a small fake dataset that we can use to explore some of the basic tips for cleaning your data in R.
You can probably already see looking at this small rectangle of data that there are few issues here to deal with. This data is tidy in terms of having one row per observation, one column per variable and each cell is a single value. However, it is not clean. We have missing data, possible incorrect values, and inconsistent text formatting in the plant column.
Creating new variables
Creating a new variable is something you are almost always going to do whether you are cleaning data or doing an analysis.
In R there are 2 basic ways of generating new variables. The base R way and the dplyr way.
For the case of this demonstration we are going to calculate a variable indicating the extent of the infestation of pests on the plant. This being number of pests / height of the plant. So number of pests per inch of the plant's height.
Assign Operator
Base R is the basic version of R that comes when you just install R and does not require any packages. It is how R usually operates.
When we want to crate a new variable using base R we have to use something which is colloquially known as the assign operator. <-
it is a less than symbol and a hyphen, to make a left pointing looking arrow. This will be familiar to any R user as it is how we save things
So to generate this new variable using base R we could write something like this
data1$pests_per_inch <- data1$n_pests/data1$height
Remember you will not see any output as you are saving that to a new column in your dataset, type in that dataset to take a look.
data1
Mutate function
The mutate function comes from the dplyr package and is used to generate new variables. It behaves like many of the tidyverse functions in that the first argument will be the dataset. You will already be familiar with this if you have reviewed our workbooks on data manipulation.
For a brief tutorial please watch this video.
In the mutate function you supply the column name as a string of text in quotation marks. Then type in how this variable is to be calculated. Remember however that you still need to use the assign operator as you are making edits to the dataset that you will want to save.
data1 <- mutate(.data = data1,"pests_per_inch" = n_pests/height)
data1
For a basic example such as this the base R way perhaps makes the most sense. The advantages of mutates comes from when you want to chain some edits together, using the pipe operator %>%
, a slightly more advanced data manipulation technique that you may be familiar with from our previous R course.
Using the pipe we can also create a column like this
data1 <- data1 %>%
mutate("pests_per_inch" = n_pests/height)
data1
Both mutate and the assign can be used to edit existing columns as well, not just make new ones.
Mutate also allows us to create multiple columns at once. We can just keep listing new columns together for our code to create, rather than doing them separately. For instance lets create a column converting height in inches to height in cm by multiplying it by 2.54.
data1 <-data1 %>%
mutate("pests_per_inch" = n_pests/height,
"height_cm" = height*2.54)
data1
Dplyr comes with other functions such as mutate_at()
, mutate_if()
and mutate_all()
to allow you to perform the same operation to multiple or even all your columns. So there is a lot of variability in what mutate can achieve.
Missing Data
R deals with missing data in quite a particular way, you need to tell it what is missing in most cases. If the values are blank where there is missing data then this is usually fine, R will understand that this is missing data.
R won’t automatically understand text strings such as "na" or something numeric like -9 as meaning data is missing, it will interpret these as actual values. If missing data has been coded in this way, you can let R know that this means that the data is missing while importing the data from a csv or xlsx file. The functions for reading in these files come with an argument to specify the codes used for missing data.
In the example below we have told R that cells in our csv that are one of these 3 values are missing data.
data1 <- read.csv("data.csv", na.strings = c("na", -9, -8))
If you were to do something like this with your data you would notice that those -8s and -9s have become NA
instead. This is because R will not keep these missing codes but use its own internal character to mean missing. R will only recognise missing data if it takes the value of NA
a special value in R to denote missing data. Note that NA
and "NA"
are not the same, the former is a special logical value in R and the other is a character string so again R would read "NA"
as a literal character string.
If your data has already been properly coded and is clean then this should theoretically be all you need to do with missing data but there are a few things you can do if this is not the case.
Using summary and table
Depending on the state of your missing data, there are 2 key ways we can check individual variables for missing data.
If your missing data is already coded as NA
within R, i.e. R understands it is missing, use summary to look at some key summary statistics of that variable. This includes how many values are NA
.
summary(data1$plant)
Now we don't see a summary of missing data for the column plant because there aren’t any. Let's look at the column plot_id instead.
summary(data1$plot_id)
Ah this time we get another value telling us the number of NA
values in our column. This being 1. If we take another look at our data we can see by looking at plot number plot 2 only has 1 plant while plot 1, 3, 4 and 5 all have 2 plants.. In this basic example we can make the assumption that plant 8 should belong to plot 2 so we can simply overwrite this value with what it should be.
data1
Other examples may not be so straightforward, if you find some missing data that is unexpected then you may need to either look at the questionnaire or data dictionary to understand the skip patterns to find if it was simply irrelevant and therefore should be missing for this observation. Or you may need to look back at an earlier version of the data to see if there was a missing data code which is not shown because of how R deals with missing data. Of course a lot of the time, missing data is truly missing and nothing needs to be done.
Alternatively, if your NA
values have not yet been established but you have known missing data codes you could use table
to look at how many observations in a variable are using one of these codes. Now in our data we have used -99
to mean missing in our fertilised variable. So lets have a look at that variable using table
table(data1$fertilised)
Again we have one missing observation according to these codes. Now we could follow up on why this is missing as it feels as if it shouldn't. But for our purposes here lets assume this is unfortunately truly missing data and we don't know the true answer.
Setting values to NA and replacing NA values
Now there are many many ways you can change a value in R. Some we will get onto later in this workbook. But for now lets focus on some ways that specifically deal with missing values.
Firstly, lets fix plot number. Now there are a few great ways to replace missing data with a new value. If you are editing an exiting column you do need to put quotation marks around its name.
The first of these works specifically in our case because we only need to fix one value. So what we can do is use the assign operator to save a change to this cell. Now to do this we need to specify exactly what cell we are looking at, we thankfully have a small piece of data so we can see easily it is row 8. Similar to when we made a new variable we write out the column but after it we put the row number in square brackets. This means row 8 of column plot_id
. And then we can assign it to be the number 2
data1$plot_id[8] <- 2
This is mainly useful when the number of changes you are making are quite small and targeted.
If you are making lots of the same change, an alternative would be to replace the row number with is.na(data$column_name)
this will tell R that in the rows of this column where this is missing data, replace it with this value instead. So the could would look like this.
data1$plot_id[is.na(data1$plot_id)] <- 2
is.na
is a function that tests whether or not a particular observation is equal to NA
.
The tidyverse offers another alternative that can be used with the mutate function. This is the replace_na
function. This replaces a missing value with a specified value. Here we need to first specify the column we are working on and then the new value it should take. Again this is best used when all that missing data should now take the same value. If they need to be different then you will need a more targeted method that may be a little more manual. Such as the first option on this page.
data1 <- data1 %>%
mutate(plot_id = replace_na(plot_id,2))
data1
Now there are a few options for doing the reverse.
We can modify the first 2 bits of code and use the same method for instance but sort of reverse the function as it were.
data1$fertilised[6] <- NA
or
data1$fertilised[data1$fertilised==-99] <- NA
Here the ==
operator is used to tell R to look for the rows where fertilised is equal to -9. We use ==
rather than =
because one equals sign tells R a value is x, while two equals signs asks R if a value is x. The first is an assignment of a value, the second is a logical test.
Again the dplyr version for doing this is available and it is na_if
. The syntax is exactly the same except we are giving the value that we want to become NA
rather than the new value NA
should become.
data1 <- data1 %>%
mutate(fertilised = na_if(fertilised, -99))
data1$fertilised
If your missing codes are consistent you can of course use mutate_all
instead to edit all your columns at once rather than one by one. Note that for mutate_all
you specify your arguments slightly differently. Therefore I recommend looking into the help page for this function as it is not always straight forward.
Checking for implausible values
Now unless your data collection and management has been 100% robust with absolutely no room for errors at all, you will more than likely come across the odd value that doesn't quite seem right. A value that looks implausible, a bit like an extreme outlier.
It can be quite easy for these types of errors to occur. They are typically data entry errors, where a value has been incorrectly inputted for whatever reason.
We can use the same methods we used to detect missing data to seek out these potential errors. Sometimes it is quite common for these errors to go unnoticed until they cause an issue in your analysis. So seeking them out early in your data cleaning process could save time later down the line.
For instance, lets use summary to look at our number of insects variable.
summary(data1$n_pests)
A minimum of 10 and maximum of 37, seems like a little bit of maybe large range.
Maybe we should look at this variable graphically to see if this seems a little out of place.
Ordinarily we would recommend using the tidyverse package ggplot2 for plotting your data, and we certainly recommend you do look at this package on your own time if you want to learn more, but for the sake of ease we will just quickly use the base plotting function hist
to take a quick look. GGplot2 offers a lot more flexibility in plotting your data and creates a more visually appealing graph.
hist(data1$n_pests)
Based on this plot, nothing is really standing out as implausible.
Let's instead look to height of the plant
summary(data1$height)
Now that is a significant gap in our range. 7 to 88 Inches. Moreover our maximum is 88 while our 75th percentile is only 12.75. That certainty doesn't seem correct.
Let's look at the plot just to be sure
hist(data1$height)
Without a doubt this value of 88 is far too high to be correct. There are two possible reasons as to why this value has been entered incorrectly.. One possibility is that this value is meant to be 8 and the enumerator accidentally put 88 instead. Or perhaps a number like 88 was meant to be a missing code and this was incorrectly entered. In a case like this, it would probably be best to contact either the data manager for any insight, or the enumerator who collected data for this particular observation.
For our case let's assume the first option is true. Sometimes if there is not a concrete explanation you may have to use a bit of your best judgement to decide what to do about this sort of error. Whether it is changing it to a specific value or changing this data to be missing.
We will see on the next page how we could go about changing this value.
The above tips work best when dealing with a numeric variable. If you are dealing with something more categorical i would recommend either using table
again or perhaps unique
. This function will display a list of all the unique values in a column, whether it is numeric, categorical, a date. Any variable type will work
unique(data1$plant)
Here we can see that there is an issue with our plant variable. Seems the labels were not standardised. This also shows us that we are dealing with the issue of that hard to spot whitespace at the beginning and the end of some of our text. Note that "Sorghum" and " Sorghum" will register as different text strings. The same goes for "Maize" and "Maize ". We should fix that.
Correcting labels / values
On the previous page we saw two more issues in our data
First, we have a data entry error in our height variable.
Secondly, our plant name variable must have been open text as our labels are not standardised.
To fix either of these issues we could use some methods we have already seen in regards to missing data, but I want to introduce a couple key functions that can be very useful in recoding your data.ifelse
and case_when
ifelse
works very much the same way IF() works in Excel.You provide a condition, a value for if that condition is true and a value for if that condition is false.
Let's use this function to correct our height variable. Again we can edit existing variables using mutate.
data1 <- data1 %>%
mutate(height = ifelse(height==88,8,height))
data1$height
In this case we want to change the value if it is equal to 88, and we want it changed to just 8 instead. If height is not 88 then we want it to stay the same. Ifelse
can be a very useful function for fixing mistakes and generating new variables for analysis.
Before we go onto standardising the plant variable we should remove that whitespace we have in our some of out data entry. This is actually something I would recommend you do with all your character variables, just in case some has made its way in, as it can be difficult to spot.
Thankfully r comes with a neat package trimws
. You provide the column you want to remove the whitespace from, and then tell R if you want it to do this at the left (start) of the string, the right (end) or both. This is specified using the "which" argument.
data1 <- data1 %>%
mutate(plant = trimws(plant, which = "both"))
unique(data1$plant)
We can see that the whitespace has been removed.
case_when
works in a similar way to ifelse
as it is about recoding variables.Particularly it is for generating new or editing existing categorical variables.
Again you provide a condition and a value for if that condition is true, however after each comma you provide a new condition and value. Some of you may be familiar with similar functions in STATA or SPSS. Note that the condition and value is separated by a tilde ~
. This is R notation to denote a formula. You may be familiar with it if you have done any statistical modelling in R before.
In the code below we want to change "Sorghum" and "sorg" to both be "Sorghum". We use the %in%
operator to tell R to look to see if the value in the cell matches one of the values in a list that we then provide in the c()
or combine function. So R will look for where plant is either "Sorghum" or "sorg". Then after the tilde we provide that new value we want it to be. We do the same for Maize. If there are any values not covered by your conditions in using case_when, it will be assumed by R that you want these to be NA. So always check that you are thorough in your code.
data1 <- data1 %>%
mutate(plant = case_when(
plant %in% c("Sorghum", "sorg") ~ "Sorghum",
plant %in% c("Maize", "M") ~ "Maize"
))
unique(data1$plant)
Now we need to make an important point about the order of your code in R. If you remember earlier we created a height in cm variable from our height in inches variable. However we now have changed one of those inch values. So if we look at height in cm, there is still an implausible value.
summary(data1$height_cm)
We could fix this directly but the easiest thing to do now would be to recalculate that height in cm variable.
data1 <-data1 %>%
mutate("pests_per_inch" = n_pests/height,
"height_cm" = height*2.54)
Let's check again
summary(data1$height_cm)
However, the better thing to do would be to make sure you do any and all of your analysis and cleaning in a logical and sequential order to avoid having to go backwards as it is very easy for errors to go unfound until later down the line. This is something we want to avoid. Therefore, we would advise focusing on missing and implausible data as your first steps in cleaning data in R before going onto generating any new variables.
Sorting data
Finally we will have a quick look at a couple quick data management tips you may find useful.
You should already be familiar with this function if you have undertaken our modules on data manipulation.
While not always necessary, it may be useful to sort data based on a variable other than a unique identifier. To do this, you can use the arrange
function. This sorts your data based on one or more variables which you supply as arguments. By default it will do this ascendingly but if you wish to do the opposite you can type the variable into the desc
function first
For instance lets sort on plot number
data1 <-data1 %>%
arrange(plot_id)
data1
Or in descending order
data1 <-data1 %>%
arrange(desc(plot_id))
data1
Or lets even order it on both plant first and then plot number
data1 <-data1 %>%
arrange(plant,plot_id)
data1
Subsetting data
Finally, something you may often wish to do is to subset your data. There can be many reasons for this, including an analysis that is only applicable to certain groups. Sometimes you may wish to do this subsetting early in your data cleaning process so that you have these subsets readily available.
With dplyr there is a simple function provided to do this. filter
does exactly what you expect it to, filter the data based on some condition you provide. You will be familiar with this function if you have taken our module on data manipulation.
For example let's create a subset of our data that is just Maize plants.
maize_data <- data1 %>%
filter(plant == "Maize")
maize_data
We can provide multiple arguments to filter our data. We can separate them using &
to mean AND, or we can use |
to mean OR.
External Links and Resources
Quick-R tutorial on some base R data managment mehods