Statisical Analysis in R : Hypothesis Testing

Hypothesis Testing: Exercises

In these exercises we are going to look at conducting some statistical hypothesis tests in R, using a new dataset. We have looked enough at the BeanSurvey data for now!

You can review the first tutorial at any time

For these set of exercises you will also need to utilise skills from our data manipulation and visualisation courses

We have explored a number of questions graphically, and with summary statistics, and now we can conduct hypothesis tests to tell us if we are able to conclude that there are statistically significant differences.

Once done here you can move onto the next module!

Data

This data is from a much larger survey of Kenyan farmers looking at whether they kept their fields fallow in any of the three previous years in a dataset called 'fallowsurvey'. The variables included are:

Column	Description
ID	An ID for the farmer
Fallow_yr1	Whether the farmer left their plot fallow in the the past season
Fallow_yr2	Whether the farmer left their plot fallow two seasons previously
Fallow_yr3	Whether the farmer left their plot fallow three seasons previously
wealth	An index of wealth based on what assets the farmer owns - larger scores equal more wealth
hheducation	Education level of the head of household
adults	Number of adults in the household
ethnicity	Ethnicity of the head of household
hhgender	Gender of the head of household

Feel free to use the table below to explore the data

Exercise 1

Question 1.1

Firstly we are going to consider whether there is a relationship between the farmer's wealth status and the ethnicity of the head of household.

Question 1.1: Produce summary statistics showing the mean and standard deviation of wealth for each of the two ethnicities, using group_by and summarise

Please press "Continue" to reveal the solution

Solution

Question 1.1: Produce summary statistics showing the mean and standard deviation of wealth for each of the two ethnicities, using group_by and summarise

If you have gone through our modules on data manipulation, you should be able to use the skills you developed.

First group_by ethnicity of the household head (ethnicity) before using summarise to get the mean and the standard deviation.

fallowsurvey %>%
  group_by(ethnicity) %>%
    summarise(mean = mean(wealth),
              standard_deviation = sd(wealth))

Appears households in which the head is of the Luo ethnicity have a lower mean than other ethnic groups. The standard deviations are fairly similar but slightly lower for Luo people.

Please press "Continue" to move to the next question

Question 1.2

Question 1.2: Produce boxplots showing the distribution of wealth by ethnicity

Please press "Continue" to reveal the solution

Solution

Question 1.2: Produce boxplots showing the distribution of wealth by ethnicity

If you have followed our data visualisation tutorials you should be familiar with this code structure.

First supply ggplot with our data (fallow survey). As we are making a box plot we should probably set our X axis to our categorical variable (ethnicity) and our continuous (wealth) along the Y. We use the geom_boxplot function to create this plot.

ggplot(fallowsurvey,aes(y=wealth,x=ethnicity))+
  geom_boxplot()

Here we can see that gap between our means a little more clearly.

Please press "Continue" to move to the next question

Question 1.3

Question 1.3: Carry out a t-test to investigate if there is a significant difference between wealth scores of the two ethnicities. Interpret the results

Please press "Continue" to reveal the solution

Solution

Question 1.3: Carry out a t-test to investigate if there is a significant difference between wealth scores of the two ethnicities. Interpret the results

For a basic t-test remember that we do not start with the data argument like we do with other tidyverse functions like ggplot. Instead this comes at the end. So we start with our dependent variable wealth. We then use a tilda ~ to denote a formula, this is a very common notation in R that you will continue to come across as you will see when moving onto linear modelling. We then supply our independent variable, ethnicity.

t.test(wealth~ethnicity,data=fallowsurvey)

We have a p-value of 0.000034, this is very small and far below the usual statistical threshold of 0.05. Therefore, we can conclude that we have strong evidence to suggest there is a difference in wealth between Luo and other ethnicities.

Please press "Continue" to move to the next question

Excerise 2

Question 2.1

Now let us consider the relationship between ethnicity and whether the plot had been left fallow.

The recommendation for fallow plots is to leave the plot fallow in at least one out of 3 seasons. So we are not so interested to look at whether the plot was left fallow in the last season, or two seasons ago, but we are interested in if it has been left fallow at all. So we will need to create a new variable first!*

Question 2.1: Use mutate to calculate a new variable called "Fallow_Last3" which takes the value "Yes" if the plot has been left fallow in any of the last three years and takes the value "No" if the plot was not left fallow in any of the last three years

Please press "Continue" to reveal the solution

Solution

For this question it is recommended to use the ifelse function.

This works by providing a set of conditions which R will check, then we provide a value for if the condition is true, and one for if the condition is false.

Now we have 3 fallow variables, one for each year. In this question however, we want to create a new fallow variable which will be equal to "Yes" if the plot has been left to fallow in ANY of the last 3 years. Therefore, we need to supply 3 conditions, one for each year, and separate them using | to denote OR.

Then we write in "Yes", the value for if any of the conditions are true, and "No", the value for if all 3 are false.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No"))

Please press "Continue" to move to the next question

Question 2.2

Question 2.2: Piping from the last step, and using the variable you created in the last step, calculate the percentages of farmers within each ethnicity who left their plot fallow in at least one of the last seasons using the tabyl and adorn_percentages functions

Remember you can use the denominator = argument to switch between row and column (col) percentages

As a reminder, you can copy and paste the solutions from previous exercises to continue adding on from the last answer.

Please press "Continue" to reveal the solution

Solution

From here we can now pipe into our functions from janitor.

Starting with tabyl which will give us the frequencies.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(ethnicity,Fallow_Last3)

Then we can add adorn_percentages to look at the proportions. By default this will be the row percentages, this is the preferred option in our current exercise as we are interested more in comparing the fallow rates between our ethnicities (our rows).

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(ethnicity,Fallow_Last3) %>%
      adorn_percentages()

Seems that Luo headed households are leaving plots to fallow at higher rates than other groups, with 45.8% leaving the plot to fallow in the last 3 years compared to 35.5%.

Please press "Continue" to move to the next question

Question 2.3

Question 2.3: Conduct a chi-square test to determine if there is a significant relationship between whether the farmer has left their plot fallow and the ethnicity of the head of household

Please press "Continue" to reveal the solution

Solution

Question 2.3: Conduct a chi-square test to determine if there is a significant relationship between whether the farmer has left their plot fallow and the ethnicity of the head of household

A very simple mistake you may have made at first was to pipe directly into the chisq.test function.

But this means the test will be performed over the proportions and not the frequencies. You will not receive an error for this, instead you will receive a result with a p-value of 1.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(ethnicity,Fallow_Last3) %>%
      adorn_percentages()%>%
      chisq.test()

Instead we need to go back one step and remove the adorn_percentages line, and then move into the test.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(ethnicity,Fallow_Last3) %>%
      chisq.test()

Again, we have a very small p-value of 0.000099. Therefore, we can reject the null hypothesis that there is no association between our two variables as there is strong evidence to suggest a relationship between rates of fallow and the ethnicity of the household head

Please press "Continue" to move to the next question

Exercise 3

Questions 3.1 / 3.2

Exercise 3

Now consider whether there is a relationship between education status of the head of household and whether the plot was left fallow, using the variable you calculated in question 2.1. Produce some summary statistics and then conduct an appropriate hypothesis test.

3.1: First Produce the summary statistics

3.2: Now conduct an appropriate hypothesis test

You might notice a problem this time with conducting the chi-square test! Consider what would be an appropriate way of remedying this problem using one of the three options discussed in the workbook: filter to remove categories, mutate and recode to combine categories, or fisher.test to use a non-parametric test

Please press "Continue" to reveal the solution

Solutions

3.1: First Produce the summary statistics

First let's take a look at our proportions using the same code as exercise 2.2 but with hheducation replacing ethnicity.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(hheducation,Fallow_Last3) %>%
      adorn_percentages()

There is some slight variation with those completing a secondary education having a higher rate of fallow than those with an education lower than them. The exception being the "other group" with a 50/50 split.

3.2: Now conduct an appropriate hypothesis test

Let's put this into our chi square test and see what happens. We are given a warning that that the approximation could be incorrect.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(hheducation,Fallow_Last3) %>%
      chisq.test()

You may have noticed that if we look at our frequencies rather than proportions, our "other" category is vastly smaller than our other groups in the data and this could affect the accuracy of our chi-square test.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(hheducation,Fallow_Last3)

As mentioned in the hint we have three valid options for dealing with this.

First we could use a filter and remove the "other" group.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
  filter(hheducation !="other") %>%
    tabyl(hheducation,Fallow_Last3) %>%
      chisq.test()

We no longer receive that warning, though our result is fairly similar. We fail to reject the null hypothesis, suggesting there may not be an association between education and fallow rates.

Alternatively we could use mutate to merge categories, personally I think it makes the most sense to merge the "other" into secondary education as this other likely consists of those with vocational training, apprenticeships or university education. Therefore making a "secondary or higher" category.

We can use ifelse to edit this variable. As a tip if you want the values of a variable to stay the same if the condition is false, then you can type in the name of the variable.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
  mutate(hheducation = ifelse(hheducation == "other" | hheducation == "secondary", "secondary_higher", hheducation)) %>%
    tabyl(hheducation,Fallow_Last3) %>%
      chisq.test()

Again we reach the same non-significant conclusion.

Finally, we can keep the data the same but instead use the non-parametric fisher.test function. This is useful when dealing with small frequencies in some groups.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    tabyl(hheducation,Fallow_Last3) %>%
      fisher.test()

Though we reach the same conclusion in this example, this will not always be the case.

Please press "Continue" to move to the next question

Exercise 4

Question 4.1

Exercise 4

Finally let's consider the relationship between wealth and the fallow status of the plot, again using the "Fallow_Last3" variable you created in exercise 2. Produce summary statistics and plots, and then choose an appropriate statistical test to investigate the relationship and interpret the results. .

4.1: Produce some summary statistics

Please press "Continue" to reveal the solution

Solution

First let's look at some summary statistics using similar code we have already used in these exercises.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    group_by(Fallow_Last3)  %>%
  summarise(mean(wealth),sd(wealth))

So would appear that those who are practising fallow have a higher wealth status on average, a mean of 2.2 compared to 1.9.

Please press "Continue" to move to the next question

Question 4.2

4.2: Create some summary plots

Please press "Continue" to reveal the solution

Solution

Let's make this into a boxplot to visualise this difference and the spread of the data.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    ggplot(aes(y=wealth,x=Fallow_Last3))+
      geom_boxplot()

Please press "Continue" to move to the next question

Question 4.3

4.3: Investigate and interpret the results using an appropriate statistical test. Consider if what you have done makes sense

Hint: you may want to think about transforming one of the variables using mutate

Please press "Continue" to reveal the solution

Solution

Now lets put this into a t-test. Remember we can pipe into t.test but we need to specify the data argument with data = .

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    t.test(wealth~Fallow_Last3,data=.)

Following the t-test we have a p-value of 0.000025. A pretty significant result but does this make sense? That a farm's wealth would be dependent on their practice of fallow.

We can imagine there being some association but it might make more sense the other way around, that a farm's fallow practices are dependent on their wealth. A poorer family may not wish to leave a plot to fallow as they may see it as a loss of much needed income while a wealthier family can afford the financial hit.

Therefore, lets look for just a general association between wealth and fallow without making an assumption about the direction of dependence. We can do this by turning wealth into a categorical variable.

We can use the cut function within a mutate to achieve this. I am going to set the breaks to be -1.5 to 1 for our low group, 1.5 to 3 for medium, then anything above 3 (up to 99 but the max is 5.25) is our high group. I have also supplied some labels for these groups to make it a bit clearer.

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    mutate(wealth_cat=cut(wealth,breaks=c(-1,1.5,3,99), labels = c("Low", "Medium", "High"))) %>%
      tabyl(wealth_cat,Fallow_Last3) %>%
        adorn_percentages()

There is looking to be an association whereby rates of fallow increase as wealth also increases.

Lets feed this into a chi square test

fallowsurvey %>%
  mutate(Fallow_Last3=ifelse(Fallow_yr1=="Yes"|Fallow_yr2=="Yes"|Fallow_yr3=="Yes","Yes","No")) %>%
    mutate(wealth_cat=cut(wealth,breaks=c(-1,1.5,3,99))) %>%
      tabyl(wealth_cat,Fallow_Last3) %>%
        chisq.test()

We have a very small p-value of 0.00011 therefore we can reject the null hypothesis and conclude there is strong evidence of an association between wealth and the practice of fallow.

Hypothesis Testing: Exercises

Data

Exercise 1

Question 1.1

Solution

Question 1.2

Solution

Question 1.3

Solution

Excerise 2

Question 2.1

Solution

Question 2.2

Solution

Question 2.3

Solution

Exercise 3

Questions 3.1 / 3.2

Solutions

Exercise 4

Question 4.1

Solution

Question 4.2

Solution

Question 4.3

Solution

Statisical Analysis in R : Hypothesis Testing - Exercises and Solutions