Hypothesis Testing: Exercises
In these exercises we are going to look at conducting some statistical hypothesis tests in R, using a new dataset. We have looked enough at the BeanSurvey
data for now!
You can review the first tutorial at any time
For these set of exercises you will also need to utilise skills from our data manipulation and visualisation courses
We have explored a number of questions graphically, and with summary statistics, and now we can conduct hypothesis tests to tell us if we are able to conclude that there are statistically significant differences.
Once done here you can move onto the next module!
Data
This data is from a much larger survey of Kenyan farmers looking at whether they kept their fields fallow in any of the three previous years in a dataset called 'fallowsurvey'. The variables included are:
Column | Description |
---|---|
ID | An ID for the farmer |
Fallow_yr1 | Whether the farmer left their plot fallow in the the past season |
Fallow_yr2 | Whether the farmer left their plot fallow two seasons previously |
Fallow_yr3 | Whether the farmer left their plot fallow three seasons previously |
wealth | An index of wealth based on what assets the farmer owns - larger scores equal more wealth |
hheducation | Education level of the head of household |
adults | Number of adults in the household |
ethnicity | Ethnicity of the head of household |
hhgender | Gender of the head of household |
Feel free to use the table below to explore the data
Exercise 1
Question 1.1
Firstly we are going to consider whether there is a relationship between the farmer's wealth status and the ethnicity of the head of household.
Question 1.1: Produce summary statistics showing the mean and standard deviation of wealth for each of the two ethnicities, using group_by
and summarise
Please press "Continue" to reveal the solution
Solution
Question 1.1: Produce summary statistics showing the mean and standard deviation of wealth for each of the two ethnicities, using group_by
and summarise
If you have gone through our modules on data manipulation, you should be able to use the skills you developed.
First group_by
ethnicity of the household head (ethnicity) before using summarise
to get the mean and the standard deviation.
fallowsurvey %>%
group_by(ethnicity) %>%
summarise(mean = mean(wealth),
standard_deviation = sd(wealth))
Appears households in which the head is of the Luo ethnicity have a lower mean than other ethnic groups. The standard deviations are fairly similar but slightly lower for Luo people.
Please press "Continue" to move to the next question
Question 1.2
Question 1.2: Produce boxplots showing the distribution of wealth by ethnicity
Please press "Continue" to reveal the solution
Solution
Question 1.2: Produce boxplots showing the distribution of wealth by ethnicity
If you have followed our data visualisation tutorials you should be familiar with this code structure.
First supply ggplot
with our data (fallow survey). As we are making a box plot we should probably set our X axis to our categorical variable (ethnicity) and our continuous (wealth) along the Y. We use the geom_boxplot
function to create this plot.
ggplot(fallowsurvey,aes(y=wealth,x=ethnicity))+
geom_boxplot()
Here we can see that gap between our means a little more clearly.
Please press "Continue" to move to the next question
Question 1.3
Question 1.3: Carry out a t-test to investigate if there is a significant difference between wealth scores of the two ethnicities. Interpret the results
Please press "Continue" to reveal the solution
Solution
Question 1.3: Carry out a t-test to investigate if there is a significant difference between wealth scores of the two ethnicities. Interpret the results
For a basic t-test remember that we do not start with the data argument like we do with other tidyverse functions like ggplot
. Instead this comes at the end. So we start with our dependent variable wealth. We then use a tilda ~
to denote a formula, this is a very common notation in R that you will continue to come across as you will see when moving onto linear modelling. We then supply our independent variable, ethnicity.
t.test(wealth~ethnicity,data=fallowsurvey)
We have a p-value of 0.000034, this is very small and far below the usual statistical threshold of 0.05. Therefore, we can conclude that we have strong evidence to suggest there is a difference in wealth between Luo and other ethnicities.
Please press "Continue" to move to the next question
Excerise 2
Question 2.1
Now let us consider the relationship between ethnicity and whether the plot had been left fallow.
The recommendation for fallow plots is to leave the plot fallow in at least one out of 3 seasons. So we are not so interested to look at whether the plot was left fallow in the last season, or two seasons ago, but we are interested in if it has been left fallow at all. So we will need to create a new variable first!*
Question 2.1: Use mutate to calculate a new variable called "Fallow_Last3" which takes the value "Yes" if the plot has been left fallow in any of the last three years and takes the value "No" if the plot was not left fallow in any of the last three years
Please press "Continue" to reveal the solution
Solution
Question 2.1: Use mutate to calculate a new variable called "Fallow_Last3" which takes the value "Yes" if the plot has been left fallow in any of the last three years and takes the value "No" if the plot was not left fallow in any of the last three years
For this question it is recommended to use the ifelse
function.
This works by providing a set of conditions which R will check, then we provide a value for if the condition is true, and one for if the condition is false.
Now we have 3 fallow variables, one for each year. In this question however, we want to create a new fallow variable which will be equal to "Yes" if the plot has been left to fallow in ANY of the last 3 years. Therefore, we need to supply 3 conditions, one for each year, and separate them using |
to denote OR.
Then we write in "Yes", the value for if any of the conditions are true, and "No", the value for if all 3 are false.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No"))
Please press "Continue" to move to the next question
Question 2.2
Question 2.2: Piping from the last step, and using the variable you created in the last step, calculate the percentages of farmers within each ethnicity who left their plot fallow in at least one of the last seasons using the tabyl
and adorn_percentages
functions
Remember you can use the denominator =
argument to switch between row and column (col) percentages
As a reminder, you can copy and paste the solutions from previous exercises to continue adding on from the last answer.
Please press "Continue" to reveal the solution
Solution
Question 2.2: Piping from the last step, and using the variable you created in the last step, calculate the percentages of farmers within each ethnicity who left their plot fallow in at least one of the last seasons using the tabyl
and adorn_percentages
functions
From here we can now pipe into our functions from janitor
.
Starting with tabyl
which will give us the frequencies.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(ethnicity,Fallow_Last3)
Then we can add adorn_percentages
to look at the proportions. By default this will be the row percentages, this is the preferred option in our current exercise as we are interested more in comparing the fallow rates between our ethnicities (our rows).
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(ethnicity,Fallow_Last3) %>%
adorn_percentages()
Seems that Luo headed households are leaving plots to fallow at higher rates than other groups, with 45.8% leaving the plot to fallow in the last 3 years compared to 35.5%.
Please press "Continue" to move to the next question
Question 2.3
Question 2.3: Conduct a chi-square test to determine if there is a significant relationship between whether the farmer has left their plot fallow and the ethnicity of the head of household
Please press "Continue" to reveal the solution
Solution
Question 2.3: Conduct a chi-square test to determine if there is a significant relationship between whether the farmer has left their plot fallow and the ethnicity of the head of household
A very simple mistake you may have made at first was to pipe directly into the chisq.test
function.
But this means the test will be performed over the proportions and not the frequencies. You will not receive an error for this, instead you will receive a result with a p-value of 1.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(ethnicity,Fallow_Last3) %>%
adorn_percentages()%>%
chisq.test()
Instead we need to go back one step and remove the adorn_percentages
line, and then move into the test.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(ethnicity,Fallow_Last3) %>%
chisq.test()
Again, we have a very small p-value of 0.000099. Therefore, we can reject the null hypothesis that there is no association between our two variables as there is strong evidence to suggest a relationship between rates of fallow and the ethnicity of the household head
Please press "Continue" to move to the next question
Exercise 3
Questions 3.1 / 3.2
Exercise 3
Now consider whether there is a relationship between education status of the head of household and whether the plot was left fallow, using the variable you calculated in question 2.1. Produce some summary statistics and then conduct an appropriate hypothesis test.
3.1: First Produce the summary statistics
3.2: Now conduct an appropriate hypothesis test
You might notice a problem this time with conducting the chi-square test! Consider what would be an appropriate way of remedying this problem using one of the three options discussed in the workbook: filter
to remove categories, mutate
and recode
to combine categories, or fisher.test
to use a non-parametric test
Please press "Continue" to reveal the solution
Solutions
3.1: First Produce the summary statistics
First let's take a look at our proportions using the same code as exercise 2.2 but with hheducation replacing ethnicity.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(hheducation,Fallow_Last3) %>%
adorn_percentages()
There is some slight variation with those completing a secondary education having a higher rate of fallow than those with an education lower than them. The exception being the "other group" with a 50/50 split.
3.2: Now conduct an appropriate hypothesis test
Let's put this into our chi square test and see what happens. We are given a warning that that the approximation could be incorrect.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(hheducation,Fallow_Last3) %>%
chisq.test()
You may have noticed that if we look at our frequencies rather than proportions, our "other" category is vastly smaller than our other groups in the data and this could affect the accuracy of our chi-square test.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(hheducation,Fallow_Last3)
As mentioned in the hint we have three valid options for dealing with this.
First we could use a filter and remove the "other" group.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
filter(hheducation !="other") %>%
tabyl(hheducation,Fallow_Last3) %>%
chisq.test()
We no longer receive that warning, though our result is fairly similar. We fail to reject the null hypothesis, suggesting there may not be an association between education and fallow rates.
Alternatively we could use mutate to merge categories, personally I think it makes the most sense to merge the "other" into secondary education as this other likely consists of those with vocational training, apprenticeships or university education. Therefore making a "secondary or higher" category.
We can use ifelse
to edit this variable. As a tip if you want the values of a variable to stay the same if the condition is false, then you can type in the name of the variable.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
mutate(hheducation = ifelse(hheducation == "other" | hheducation == "secondary", "secondary_higher", hheducation)) %>%
tabyl(hheducation,Fallow_Last3) %>%
chisq.test()
Again we reach the same non-significant conclusion.
Finally, we can keep the data the same but instead use the non-parametric fisher.test
function. This is useful when dealing with small frequencies in some groups.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
tabyl(hheducation,Fallow_Last3) %>%
fisher.test()
Though we reach the same conclusion in this example, this will not always be the case.
Please press "Continue" to move to the next question
Exercise 4
Question 4.1
Exercise 4
Finally let's consider the relationship between wealth and the fallow status of the plot, again using the "Fallow_Last3" variable you created in exercise 2. Produce summary statistics and plots, and then choose an appropriate statistical test to investigate the relationship and interpret the results. .
4.1: Produce some summary statistics
Please press "Continue" to reveal the solution
Solution
First let's look at some summary statistics using similar code we have already used in these exercises.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
group_by(Fallow_Last3) %>%
summarise(mean(wealth),sd(wealth))
So would appear that those who are practising fallow have a higher wealth status on average, a mean of 2.2 compared to 1.9.
Please press "Continue" to move to the next question
Question 4.2
4.2: Create some summary plots
Please press "Continue" to reveal the solution
Solution
Let's make this into a boxplot to visualise this difference and the spread of the data.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
ggplot(aes(y=wealth,x=Fallow_Last3))+
geom_boxplot()
Please press "Continue" to move to the next question
Question 4.3
4.3: Investigate and interpret the results using an appropriate statistical test. Consider if what you have done makes sense
Hint: you may want to think about transforming one of the variables using mutate
Please press "Continue" to reveal the solution
Solution
Now lets put this into a t-test. Remember we can pipe into t.test
but we need to specify the data argument with data = .
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
t.test(wealth~Fallow_Last3,data=.)
Following the t-test we have a p-value of 0.000025. A pretty significant result but does this make sense? That a farm's wealth would be dependent on their practice of fallow.
We can imagine there being some association but it might make more sense the other way around, that a farm's fallow practices are dependent on their wealth. A poorer family may not wish to leave a plot to fallow as they may see it as a loss of much needed income while a wealthier family can afford the financial hit.
Therefore, lets look for just a general association between wealth and fallow without making an assumption about the direction of dependence. We can do this by turning wealth into a categorical variable.
We can use the cut
function within a mutate to achieve this. I am going to set the breaks to be -1.5 to 1 for our low group, 1.5 to 3 for medium, then anything above 3 (up to 99 but the max is 5.25) is our high group. I have also supplied some labels for these groups to make it a bit clearer.
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
mutate(wealth_cat=cut(wealth,breaks=c(-1,1.5,3,99), labels = c("Low", "Medium", "High"))) %>%
tabyl(wealth_cat,Fallow_Last3) %>%
adorn_percentages()
There is looking to be an association whereby rates of fallow increase as wealth also increases.
Lets feed this into a chi square test
fallowsurvey %>%
mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
mutate(wealth_cat=cut(wealth,breaks=c(-1,1.5,3,99))) %>%
tabyl(wealth_cat,Fallow_Last3) %>%
chisq.test()
We have a very small p-value of 0.00011 therefore we can reject the null hypothesis and conclude there is strong evidence of an association between wealth and the practice of fallow.