Overview
This set of exercises helps to test your sills from each of our three tutorials from our data management module.
You can review the tutorials for assistance at any time here
Part 1 - Data Cleaning
Dataset
To start with you will work on the following data set which contains a number of variables for a list of 100 farmers.
farmer_data
Exercise 1a - Trim Whitespace
Let's start by checking for any whitespace, i suspect there may be some in our name column.
Let's perform a check by using the unique
function to look at all the unique values for name
unique(farmer_data$name)
## [1] "Treven" "Brandon" "Kevin" "Erik" "Trusten"
## [6] "Saqr" "Bradford" "Jack" " Codie" "Frank"
## [11] "Tyler " "Joseph" "Royse" "Reece" "Michael"
## [16] "Paul" "Nathan" "Isaiah" "Taylor" "Matthew"
## [21] "Glen" "Marzooq" "Richard" "Justin" "Jonah"
## [26] "Zaahir" "Kabeer " "Lawrence" "Hamad" "Waseem"
## [31] "Sonim" "Hamza" "Zainuddeen" "Martin" " Ryan"
## [36] "Orran" "Geoffrey" "Zachery" "Lucas" "Cass"
## [41] "Luis" "Donald" " Hunter" "Devon" "Anthony"
## [46] "Carmen" "Kristin" "Briana" "Terra" " Priscilla "
## [51] "Danielle" "Vanessa " "Patricia" "Kiana" "Tammy"
## [56] "Astrid" "Savanna" "Waleeda" "Marilu" "Samantha"
## [61] "Amber" "Manaara" "Sabeeka" "Shamika" "Heaven"
## [66] "Tamanna" "Magan" "Lasha" "Sri" "Cheyenne"
## [71] "Yesenia" "Autum " "Aleksandrina" "Sonia" "Nicolette"
## [76] "Faseeha" "Alexis" "Ceara" "Danica" "Kendrer"
## [81] "Antoinette" "Kayleigh" "Micaela" "Amari" " Samantha"
## [86] "Olivia" "Breanna" "Geovana" "Najma" "Selena"
## [91] "Aanisa" "Charmaine"
Seems our suspicions were correct, we have around 10 values for name that have either a leading or trailing space in them.
Exercise 1a: Using mutate
and trimws
, remove the whitespace present in the name column from both the beginning and the ending of the variable entries
REMEMBER You can inspect the results of your changes at any time by writing the name of the data frame into the code chunk at any time
farmer_data <- farmer_data %>%
mutate()
unique(farmer_data$name)
farmer_data <- farmer_data %>%
mutate(name = trimws(name, which = "both"))
unique(farmer_data$name)
Exercise 1b - Missing Data
Recall that in R we need to be very explicit when it comes to defining data. It MUST be coded as NA
In our dataset, we have unfortunately not told R explicitly what our missing data codes were when we imported the data. Therefore, there are a few variables which have some values equalling -9. However this was a missing data code, but R will see this value as literally -9.
Let's check first which variables this issue applies to.
Exercise 1bi: Use summary
to check which continuous variables in the dataset are still using the missing data code -9
rather than being explicitly NA
. Note that -9 will be the minimum value in each of these cases.
summary(farmer_data)
farmer_data
# Should have found that at least 1 value in each of hh_size, income and costs were equal to -9
You should have identified a few variables unfortunately using this code.
Exercise 1bii: Using mutate
and na_if
, correct the variables that are still using -9
so that these are instead coded as NA
farmer_data <- farmer_data %>%
mutate()
summary(farmer_data)
farmer_data <- farmer_data %>%
mutate(hh_size = na_if(hh_size, -9),
income = na_if(income, -9),
costs = na_if(costs, -9))
summary(farmer_data)
#This would only be useful if -9 will ALWAYS be a missing code and never a true value of a variable
#Recommend looking into mutate_at or mutate_if to see how you can make the same edit across multiple but not all columns
Exercise 1c - Correcting labels
Now something else looked off about our sex variable. Let's use unique
to have a look at all of the unique values of sex
unique(farmer_data$sex)
## [1] "1" "Male" "m" "M" "f" "Female" "2" "F"
Hmm, seems we have multiple values to mean the same thing unfortunately. Male and Female have both been coded four different ways each.
Exercise 1c: Using mutate
and case_when
, standardise the sex variable such that
Male, m, M & "1" are all recoded to equal "Male"
Female, f, F & "2" are all recoded to equal "Female"
Hint Think back to how we used the %in%
operator
farmer_data <- farmer_data %>%
mutate()
farmer_data <- farmer_data %>%
mutate(sex = case_when(
sex %in% c("Male", "m", "M", "1") ~ "Male",
sex %in% c("Female", "f", "F", "2") ~ "Female"
))
table(farmer_data$sex)
Exercise 1d - Correcting outlier values
Let's use a histogram to inspect one of our continuous variables for possible data entry errors
hist(farmer_data$age)
Looks like we have one implausibly high outlying value for age.
Upon contacting the enumerator who collected data for this farmer, you find the true value was meant to be 18 years It was just inputted wrongly.
Exercise 1d: Using any method demonstrated in the workbook, correct this implausibly high value of 180 years old to the correct value of 18 years old.
???
hist(farmer_data$age)
farmer_data$age[farmer_data$age == 180] <- 19
#OR
farmer_data <- farmer_data%>%
mutate(age = ifelse(age == 180, 18, age))
hist(farmer_data$age)
Exercise 1e - Creating a new variable
As part of the data analysis plan you notice that part of the analysis will include an evaluation of the gross income of farmers, a calculation of income minus costs. However, you find this was not calculated directly as part of the data collection.
Exercise 1e: Using mutate
create a new variable called gross_income
which is the income minus the costs.
farmer_data <- farmer_data%>%
mutate(gross_income = income-costs)
farmer_data
Exercise 1f - Sorting data
Finally, you may have noticed that there is currently no particular order to our data. While not mandatory, sorting data can be useful for presentation, readability and help users easily find certain pieces of information.
We have a natural sorting variable in our unique identifier "id"
Exercise 1f: Using arrange
, sort the data using the id variable so that the rows are in order from 1 to 100 by id.
farmer_data <- farmer_data%>%
arrange(id)
farmer_data
Part 2 - Data Reshaping
Datasets
Before we test your skills in pivoting data, let’s look at some additional data we have for this project. It is at the level of the plot while our previous dataset was at the level of the farmer.
Each farmer can have up to 8 plots, they are labelled id_1, id_2 and so on. You will notice that many of the values in these columns are NA as the farmers do not all have this many plots.`
plot_data
Secondly, you have been provided by an outside source with some data about the villages that have been used in this project. However, it has unfortunately been passed along in long format with individual rows for the villages' population, sampling weights and whether or not they received support from an NGO.
village_data
Exercise 2a - Pivot longer
Firstly, let's get our plot data into a more useful shape. When conducting analysis it will be much simpler to keep comparable information into the same column. i.e keep all values for the area of plot in just the single column.
Exercise 2a: Using pivot_longer
reshape the wide plot data into a new long data set, save this output as plot_data_long
and call the new variable, that will list the plot number within a farmer, "plot_num".
Please include the values_drop_na = TRUE
argument into your answer. This will stop unnecessary rows being created for where all variables will be NA
. In other words, all farmers will only have as many rows as they will plots.
Hint: If you are stuck at all, look back at the session 2 workbook and how we utilised the special string of ".value"
.
plot_data_long <-plot_data%>%
pivot_longer(
values_drop_na = TRUE
)
plot_data_long <-plot_data%>%
pivot_longer(
cols = id_1:fertilisers_8,
names_to = c(".value", "plot_num"),
names_sep = "_",
values_drop_na = TRUE
)
plot_data_long
Exercise 2b - Pivot wider
Next let's do the opposite to that village data.
Exercise 2b: Using pivot_wider
, transform the data into wide format such that you only have one row per village. Save this to a new object titled village_data_wide
village_data_wide <- village_data%>%
pivot_wider()
village_data_wide <- village_data%>%
pivot_wider(
names_from = information,
values_from = value
)
village_data_wide
Part 3 - Data Merging
Datasets
For the first exercise of this final section, we are going to bind our previous data sets with some additional observations for our farmers and plots. 20 new farmers with 45 plots between them.
farmer_data_additional
plot_data_additional
Exercise 3a - Binding
As the datasets for our additional data sets which use the same formats that we have developed from our tidied up data. We can move straight to appending them onto our existing data frames.
Exercise 3a: Using bind_rows
, append on the new observations to farmer_data and plot_data_long.
NOTE: The cleaned version of the farmer data has been saved to a new object called farmer_data_clean
. Please use this dataset instead of farmer_data
Farmer Data
farmer_data_full <- ???
farmer_data_full
farmer_data_full <- bind_rows(farmer_data_clean, farmer_data_additional)
farmer_data_full
Plot Data
plot_data_full <- ???
plot_data_full
plot_data_full <- bind_rows(plot_data_long, plot_data_additional)
plot_data_full
Exercise 3b - Summarising
Now we will move onto first summarising some data from a lower level to then then join it up with data at a higher level.
So as part of our data analysis plan we need a total farm area which is the sum of the individual plot areas for each individual farmer. We then want to merge this with our farmer data set.
Exercise 3b: Using group_by
and summarise
, create a new data set which is a summary of the plot level data. Call this plot_data_summary
. Create summary variables for the total area of a farmer's plots using sum
and then also create a variable which counts the number of plots a farmer owns.
plot_data_summary <- plot_data_full%>%
plot_data_summary <- plot_data_full%>%
group_by(farmer_id)%>%
summarise(total_area = sum(area, na.rm = TRUE),
nplots = n())
plot_data_summary
Exercise 3c - Joining data
Join the plot_data_summary data set onto the farmer_data_full dataset. Now we actually have a few farmers who we have no plot data available. We have made the decision for analysis that we only want farmers who own at least one plot.
Exercise 3c: Using a type of join that will only keep the farmers who have plot data that they can be matched to, join plot_data_summary onto farmer_data_full.
HINT Remember that you can use c()
to match variables where they have different names in the two datasets
farmer_data_full <-
farmer_data_full <- inner_join(
farmer_data_full, plot_data_summary, by = c("id" = "farmer_id")
)