Data Management - Exercises

Overview

This set of exercises helps to test your sills from each of our three tutorials from our data management module.

You can review the tutorials for assistance at any time here

Part 1 - Data Cleaning

Dataset

To start with you will work on the following data set which contains a number of variables for a list of 100 farmers.

farmer_data

Exercise 1a - Trim Whitespace

Let's start by checking for any whitespace, i suspect there may be some in our name column.

Let's perform a check by using the unique function to look at all the unique values for name

unique(farmer_data$name)

##  [1] "Treven"       "Brandon"      "Kevin"        "Erik"         "Trusten"     
##  [6] "Saqr"         "Bradford"     "Jack"         " Codie"       "Frank"       
## [11] "Tyler "       "Joseph"       "Royse"        "Reece"        "Michael"     
## [16] "Paul"         "Nathan"       "Isaiah"       "Taylor"       "Matthew"     
## [21] "Glen"         "Marzooq"      "Richard"      "Justin"       "Jonah"       
## [26] "Zaahir"       "Kabeer "      "Lawrence"     "Hamad"        "Waseem"      
## [31] "Sonim"        "Hamza"        "Zainuddeen"   "Martin"       " Ryan"       
## [36] "Orran"        "Geoffrey"     "Zachery"      "Lucas"        "Cass"        
## [41] "Luis"         "Donald"       " Hunter"      "Devon"        "Anthony"     
## [46] "Carmen"       "Kristin"      "Briana"       "Terra"        " Priscilla " 
## [51] "Danielle"     "Vanessa "     "Patricia"     "Kiana"        "Tammy"       
## [56] "Astrid"       "Savanna"      "Waleeda"      "Marilu"       "Samantha"    
## [61] "Amber"        "Manaara"      "Sabeeka"      "Shamika"      "Heaven"      
## [66] "Tamanna"      "Magan"        "Lasha"        "Sri"          "Cheyenne"    
## [71] "Yesenia"      "Autum "       "Aleksandrina" "Sonia"        "Nicolette"   
## [76] "Faseeha"      "Alexis"       "Ceara"        "Danica"       "Kendrer"     
## [81] "Antoinette"   "Kayleigh"     "Micaela"      "Amari"        " Samantha"   
## [86] "Olivia"       "Breanna"      "Geovana"      "Najma"        "Selena"      
## [91] "Aanisa"       "Charmaine"

Seems our suspicions were correct, we have around 10 values for name that have either a leading or trailing space in them.

Exercise 1a: Using mutate and trimws, remove the whitespace present in the name column from both the beginning and the ending of the variable entries

REMEMBER You can inspect the results of your changes at any time by writing the name of the data frame into the code chunk at any time

farmer_data <- farmer_data %>%
  mutate()

unique(farmer_data$name)

farmer_data <- farmer_data %>%
  mutate(name = trimws(name, which = "both"))

unique(farmer_data$name)

Exercise 1b - Missing Data

Recall that in R we need to be very explicit when it comes to defining data. It MUST be coded as NA

In our dataset, we have unfortunately not told R explicitly what our missing data codes were when we imported the data. Therefore, there are a few variables which have some values equalling -9. However this was a missing data code, but R will see this value as literally -9.

Let's check first which variables this issue applies to.

Exercise 1bi: Use summary to check which continuous variables in the dataset are still using the missing data code -9 rather than being explicitly NA. Note that -9 will be the minimum value in each of these cases.

summary(farmer_data)

farmer_data

# Should have found that at least 1 value in each of hh_size, income and costs were equal to -9

You should have identified a few variables unfortunately using this code.

Exercise 1bii: Using mutate and na_if, correct the variables that are still using -9 so that these are instead coded as NA

farmer_data <- farmer_data %>%
  mutate()

summary(farmer_data)

farmer_data <- farmer_data %>%
  mutate(hh_size = na_if(hh_size, -9),
         income = na_if(income, -9),
         costs = na_if(costs, -9))

summary(farmer_data)

#This would only be useful if -9 will ALWAYS be a missing code and never a true value of a variable

#Recommend looking into mutate_at or mutate_if to see how you can make the same edit across multiple but not all columns

Exercise 1c - Correcting labels

Now something else looked off about our sex variable. Let's use unique to have a look at all of the unique values of sex

unique(farmer_data$sex)

## [1] "1"      "Male"   "m"      "M"      "f"      "Female" "2"      "F"

Hmm, seems we have multiple values to mean the same thing unfortunately. Male and Female have both been coded four different ways each.

Exercise 1c: Using mutate and case_when, standardise the sex variable such that

Male, m, M & "1" are all recoded to equal "Male"

Female, f, F & "2" are all recoded to equal "Female"

Hint Think back to how we used the %in% operator

farmer_data <- farmer_data %>%
  mutate()

farmer_data <- farmer_data %>%
  mutate(sex = case_when(
    sex %in% c("Male", "m", "M", "1") ~ "Male",
    sex %in% c("Female", "f", "F", "2") ~ "Female"
  ))

table(farmer_data$sex)

Exercise 1d - Correcting outlier values

Let's use a histogram to inspect one of our continuous variables for possible data entry errors

hist(farmer_data$age)

Looks like we have one implausibly high outlying value for age.

Upon contacting the enumerator who collected data for this farmer, you find the true value was meant to be 18 years It was just inputted wrongly.

Exercise 1d: Using any method demonstrated in the workbook, correct this implausibly high value of 180 years old to the correct value of 18 years old.

???

hist(farmer_data$age)

farmer_data$age[farmer_data$age == 180] <- 19

#OR

farmer_data <- farmer_data%>%
  mutate(age = ifelse(age == 180, 18, age))

hist(farmer_data$age)

Exercise 1e - Creating a new variable

As part of the data analysis plan you notice that part of the analysis will include an evaluation of the gross income of farmers, a calculation of income minus costs. However, you find this was not calculated directly as part of the data collection.

Exercise 1e: Using mutate create a new variable called gross_income which is the income minus the costs.

farmer_data <- farmer_data%>%
  mutate(gross_income = income-costs)

farmer_data

Exercise 1f - Sorting data

Finally, you may have noticed that there is currently no particular order to our data. While not mandatory, sorting data can be useful for presentation, readability and help users easily find certain pieces of information.

We have a natural sorting variable in our unique identifier "id"

Exercise 1f: Using arrange, sort the data using the id variable so that the rows are in order from 1 to 100 by id.

farmer_data <- farmer_data%>%
  arrange(id)

farmer_data

Part 2 - Data Reshaping

Datasets

Before we test your skills in pivoting data, let’s look at some additional data we have for this project. It is at the level of the plot while our previous dataset was at the level of the farmer.

Each farmer can have up to 8 plots, they are labelled id_1, id_2 and so on. You will notice that many of the values in these columns are NA as the farmers do not all have this many plots.`

plot_data

Secondly, you have been provided by an outside source with some data about the villages that have been used in this project. However, it has unfortunately been passed along in long format with individual rows for the villages' population, sampling weights and whether or not they received support from an NGO.

village_data

Exercise 2a - Pivot longer

Firstly, let's get our plot data into a more useful shape. When conducting analysis it will be much simpler to keep comparable information into the same column. i.e keep all values for the area of plot in just the single column.

Exercise 2a: Using pivot_longer reshape the wide plot data into a new long data set, save this output as plot_data_long and call the new variable, that will list the plot number within a farmer, "plot_num".

Please include the values_drop_na = TRUE argument into your answer. This will stop unnecessary rows being created for where all variables will be NA. In other words, all farmers will only have as many rows as they will plots.

Hint: If you are stuck at all, look back at the session 2 workbook and how we utilised the special string of ".value".

plot_data_long <-plot_data%>%
  pivot_longer(
    values_drop_na = TRUE
  )

plot_data_long <-plot_data%>%
  pivot_longer(
    cols = id_1:fertilisers_8,
    names_to = c(".value", "plot_num"),
    names_sep = "_",
    values_drop_na = TRUE
  )

plot_data_long

Exercise 2b - Pivot wider

Next let's do the opposite to that village data.

Exercise 2b: Using pivot_wider, transform the data into wide format such that you only have one row per village. Save this to a new object titled village_data_wide

village_data_wide <- village_data%>%
  pivot_wider()

village_data_wide <- village_data%>%
  pivot_wider(
    names_from = information,
    values_from = value
  )

village_data_wide

Part 3 - Data Merging

Datasets

For the first exercise of this final section, we are going to bind our previous data sets with some additional observations for our farmers and plots. 20 new farmers with 45 plots between them.

farmer_data_additional

plot_data_additional

Exercise 3a - Binding

As the datasets for our additional data sets which use the same formats that we have developed from our tidied up data. We can move straight to appending them onto our existing data frames.

Exercise 3a: Using bind_rows, append on the new observations to farmer_data and plot_data_long.

NOTE: The cleaned version of the farmer data has been saved to a new object called farmer_data_clean. Please use this dataset instead of farmer_data

Farmer Data

farmer_data_full <- ???
  
farmer_data_full

farmer_data_full <- bind_rows(farmer_data_clean, farmer_data_additional)

farmer_data_full

Plot Data

plot_data_full <- ???
  
plot_data_full

plot_data_full <- bind_rows(plot_data_long, plot_data_additional)

plot_data_full

Exercise 3b - Summarising

Now we will move onto first summarising some data from a lower level to then then join it up with data at a higher level.

So as part of our data analysis plan we need a total farm area which is the sum of the individual plot areas for each individual farmer. We then want to merge this with our farmer data set.

Exercise 3b: Using group_by and summarise, create a new data set which is a summary of the plot level data. Call this plot_data_summary. Create summary variables for the total area of a farmer's plots using sum and then also create a variable which counts the number of plots a farmer owns.

plot_data_summary <- plot_data_full%>%

plot_data_summary <- plot_data_full%>%
  group_by(farmer_id)%>%
  summarise(total_area = sum(area, na.rm = TRUE),
            nplots = n())

plot_data_summary

Exercise 3c - Joining data

Join the plot_data_summary data set onto the farmer_data_full dataset. Now we actually have a few farmers who we have no plot data available. We have made the decision for analysis that we only want farmers who own at least one plot.

Exercise 3c: Using a type of join that will only keep the farmers who have plot data that they can be matched to, join plot_data_summary onto farmer_data_full.

HINT Remember that you can use c() to match variables where they have different names in the two datasets

farmer_data_full <-

farmer_data_full <- inner_join(
  farmer_data_full, plot_data_summary, by = c("id" = "farmer_id")
)