Exercises
Remember when completing these exercises, that nobody can remember every single piece of R code from the top of their heads! If you are getting stuck, look back over the notes to try to find similar examples and see if you can then work out how to copy and then modify this code to meet the exercise objectives.
All of the exercises are using the BeanSurvey
data we have been working with so far. You can see a description of the data in the appendix to remind yourself.
You can always review the tutorials at any time you need, Part 1, Part 2
Exercise 1
Exercise
Exercise 1. retrieve the households that grow banana
Please press "Continue" to reveal the solution
Solution
Exercise 1. retrieve the households that grow banana
The exercise is asking that we only keep the households that grow bananas. Therefore, we need to use the filter
function, as you'll remember this function allows us to keep certain rows that meet a certain condition.
Remember that we have to use the double equals ==
, as this asks R "is x equal to y?" where as a single equal sign tells R that "X IS EQUAL TO Y!". For a filter
we always use the double ==
.
filter(BeanSurvey, BANANA=="Yes")
We could also have used a pipe:
BeanSurvey %>%
filter(BANANA=="Yes")
Please press "Continue" to move to the next exercise
Exercise 2
Exercise
Exercise 2. Identify and correct the four mistakes that I made in the command below, to obtain the median land area of farm of all the households in the BeanSurvey dataset
BeanSurvey %>%
filter(BeanSurvey, OCCUHH="Farmer") %>%
sumarise(median_landArea = median(LANDAREA)
Please press "Continue" to reveal the solution
Solution
Exercise 2. Identify and correct the four mistakes that I made in the command below, to obtain the median land area of farm of all the households in the BeanSurvey dataset
EXERCISE:
BeanSurvey %>%
filter(BeanSurvey, OCCUHH="Farmer") %>%
sumarise(median_landArea = median(LANDAREA)
We have a few mistakes to correct here so let's go one by one.
In our filter we have supplied BeanSurvey to the data argument, but we have already started our code chunk with the data being piped into our filter. We have doubled up on telling R what our dataset is. Remember that when piping the first data argument will be implicitly filled by whatever the data looks like following the previous step. So either BeanSurvey should be removed from filter, or the first line should be deleted.
As already mentioned in the previous exercise, the condition of the filter function needs a double equals not a single equals,
The function summarise was misspelled. While R will recognise both the american summarize
and the British summarise
spellings, the rest of the function still needs to be spelt correctly.
Finally, the parenthesis of the function summarise was not closed. This is often the simplest mistake to make.
BeanSurvey %>%
filter(OCCUHH=="Farmer") %>%
summarize(median_landArea = median(LANDAREA))
Please press "Continue" to move to the next exercise
Exercise 3
Exercise
Exercise 3. which are the 4 households who planted the largest quantity of beans during short rain season?
Please press "Continue" to reveal the solution
Solution
Exercise 3. which are the 4 households who planted the largest quantity of beans during short rain season?
Approaching this question we first need to think how can we figure out which households plated the most beans. It would make the most sense if we can sort the data on this variable so we have the rows ordered by how much they planted. Therefore we need to use the arrange
function.
BeanSurvey %>%
arrange(BEANSPLANTED_SR)
However, you may remember that this orders ascendingly by default, so those with the lowest numbers would be at the top. We can correct this by wrapping the variable in the desc
function.
BeanSurvey %>%
arrange(desc(BEANSPLANTED_SR))
But we still have all the rows. While we can see which are the top 4, let's go one further and only keep these first 4. We can do this using slice
this will keep whichever row numbers we supply, in this case 1:4
BeanSurvey %>%
arrange(desc(BEANSPLANTED_SR)) %>%
slice(1:4)
Please press "Continue" to move to the next exercise
Exercise 4
Exercise
Exercise 4. What is the mean, median and standard deviation of the yield of bean per acre that households harvested during the long rain season in each village
Exercise 4b. How would you produce the same summary statistics, but by village AND by gender of the head of household rather than by village only?
Please press "Continue" to reveal the solution
Solutions
Exercise 4. What is the mean, median and standard deviation of the yield of bean per acre that households harvested during the long rain season in each village
The yield of beans per acre is not a column that we have in our dataset, so we need to create it. Remember that mutate
is the function we use for that. So we start with first creating a new column, we'll call it yield_per_acre
, and set this to be equal to our harvested quantity (BEANSHARVESTED_LR
) divided by our farm area (LANDAREA
).
BeanSurvey %>%
mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA)
We then want summaries for each village so first we need to use group_by
to make sure our results are calculated within each village. We pipe this into summarise and specify each of the summary statistics mentioned in the question. They each have straightforward function names of mean
, median
and sd
(standard deviation).
The argument na.rm is required as there are missing values in the variable BEANSHARVESTED_LR, and therefore also in yield_per_acre
BeanSurvey %>%
mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
group_by(VILLAGE) %>%
summarise(mean= mean(yield_per_acre, na.rm=TRUE),
median=median(yield_per_acre, na.rm=TRUE),
standard_deviation=sd(yield_per_acre, na.rm=TRUE))
Another solution to address the missing value issue would be to filter out the households with missing value in the first place, using filter and is.na
, for example at the very beginning:
BeanSurvey %>%
filter(is.na(BEANSHARVESTED_LR)==FALSE) %>%
mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
group_by(VILLAGE) %>%
summarise(mean= mean(yield_per_acre),
median=median(yield_per_acre),
standard_deviation=sd(yield_per_acre))
Exercise 4b. How would you produce the same summary statistics, but by village AND by gender of the head of household rather than by village only?
We can start by just copying the code from the question 4 and add the column GENDERHH in the group_by
function.
BeanSurvey %>%
filter(is.na(BEANSHARVESTED_LR)==FALSE) %>%
mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
group_by(VILLAGE, GENDERHH) %>%
summarise(mean= mean(yield_per_acre),
median=median(yield_per_acre),
standard_deviation=sd(yield_per_acre))
However we have 5 rows when we expected just 4?
Would seem that there is one row where there is a missing value for GENDERHH. Probably better we remove this from the analysis too so we have a neater output.
BeanSurvey %>%
filter(is.na(BEANSHARVESTED_LR)==FALSE & is.na(GENDERHH)==FALSE) %>%
mutate(yield_per_acre = BEANSHARVESTED_LR/LANDAREA) %>%
group_by(VILLAGE, GENDERHH) %>%
summarise(mean= mean(yield_per_acre),
median=median(yield_per_acre),
standard_deviation=sd(yield_per_acre))
Please press "Continue" to move to the next exercise
Exercise 5
Exercise
Exercise 5. Generate a scatterplot showing for each household who have planted beans, the total quantity of the beans planted against the land area of their farm. Colour the points by gender of the head of household
Please press "Continue" to reveal the solution
Solution
Exercise 5. Generate a scatterplot showing for each household who have planted beans, the total quantity of the beans planted against the land area of their farm. Colour the points by gender of the head of household
We create variable a total quantity of beans planted using mutate
and then pipe into ggplot
. Remember that once you swap into ggplot
to use +
to add layers and not to use the %>%
pipe. Though doing this will generate a helpful error message.
BeanSurvey %>%
mutate(total_beans_planted = BEANSPLANTED_LR + BEANSPLANTED_SR)%>%
ggplot(aes(x=LANDAREA, y=total_beans_planted, colour=GENDERHH))+
geom_point()
But of course we have that missing value in GENDERHH, this has been added into our colour scheme as a grey dot that is included in our key
We can decide if it's an issue or not. I decide to remove it and so I add a filter to get rid of the missing value.
BeanSurvey %>%
mutate(total_beans_planted = BEANSPLANTED_LR + BEANSPLANTED_SR)%>%
filter(is.na(GENDERHH)==FALSE) %>%
ggplot(aes(x=LANDAREA, y=total_beans_planted, colour=GENDERHH))+
geom_point()
Please press "Continue" to move to the next exercise
Exercise 6
Exercise
Exercise 6. Generate a boxplot of the quantity of beans harvested during the long rain season by type of household composition, keeping only the two main household composition types. Apply an appropriate 'scale' transformation to the quantity of beans harvested
Please press "Continue" to reveal the solution
Solution
Exercise 6. Generate a boxplot of the quantity of beans harvested during the long rain season by type of household composition, keeping only the two main household composition types. Apply an appropriate 'scale' transformation to the quantity of beans harvested
This is a follow up of a question from the ggplot exercises you may have previously worked on.
We first simply use filter
to keep only the households whose composition type is either "Female headed, no husband" or "Male headed one wife". Remember we can use the vertical line |
to mean OR in a filter.
We then pipe this into our ggplot using both geom_boxplot
and geom_point
to show all our data.
BeanSurvey %>%
filter(HHTYPE=="Female headed, no husband" | HHTYPE=="Male headed one wife") %>%
ggplot(aes(y=BEANSHARVESTED_LR, x=HHTYPE))+
geom_boxplot()+
geom_point()
Finally, the question asks that we include some kind of transformation to our y axis scale. Therefore we need to use one of the scale_
functions. In this case, we want to edit the y axis, which is a continuous variable. SO we need scale_y_continuous
.
If you use the web or the help page for scale_y_continuous
. You will see that one of the arguments is trans =
. This can take any number of different operations by default. Including many different log transformations, reverse, square roots, etc. Feel free to play around with what ones work and which ones don't.
In our example, "pseudo_log" seems like a good option as it provides more space for the number as at the lower end of the scale so our outliers right at the top take up less room on the plotting space.
BeanSurvey %>%
filter(HHTYPE=="Female headed, no husband" | HHTYPE=="Male headed one wife") %>%
ggplot(aes(y=BEANSHARVESTED_LR, x=HHTYPE))+
geom_boxplot()+
geom_point()+
scale_y_continuous(trans="pseudo_log")
Appendix: 'BeanSurvey' dataset
The data we are using in this session is an extract of a survey conducted in Uganda from farmers identified as growing beans.
The dataset contains an extract of 50 responses to 23 of the survey questions, and has been imported to R as a data frame called BeanSurvey
.
A summary of the columns in the dataset is below.
Column | Description |
---|---|
ID | Farmer ID |
VILLAGE | Village name |
HHTYPE | Household composition |
GENDERHH | Gender of Household Head |
AGEHH | Age of Household Head |
OCCUHH | Occupation of Household Head |
ADULTS | Number of Adults within the household |
CHILDREN | Number of Children (<18) within the household |
MATOKE | Do they grow matoke? |
MAIZE | Do they grow maize? |
BEANS | Do they grow beans? |
BANANA | Do they grow banana? |
CASSAVA | Do they grow cassava? |
COFFEE | Do they grow coffee? |
LANDAREA | Land area of farm (acres) |
LABOR | Labor usage |
INTERCROP | Intercrops with beans |
DECISIONS | Household decision responsibility |
SELLBEANS | Do they grow beans for sale? |
BEANSPLANTED_LR | Quantity of beans planted in long rain season |
BEANSPLANTED_SR | Quantity of beans planted in short rain season |
BEANSHARVESTED_LR | Quantity of beans harvested in long rain season |
BEANSHARVESTED_SR | Quantity of beans harvested in short rain season |
Spend some time exploring the full dataset embedded below, to familiarise yourself with the columns and the type of data stored within each column. You may need to refer back to this data at times during this tutorial. Remember that R is case sensitive, so you will always have to refer to the variables in this dataset exactly as they are written in the data. There is a column in this data called "GENDERHH" but there is no column in this data called "GenderHH".
(You can use the arrow keys on your keyboard to scroll right in case the data table does not fit entirely on your screen)