Statistical Analysis: Exercises and Solutions 2

Linear Modelling: Exercises

In these exercises we are going to look at conducting some linear models tests in R, using a new dataset.

We have explored a number of questions graphically, with summary statistics, and with simple hypothesis tests. Now we are going to move onto linear models so we can learn more about the relationships within our data and their statistical significance.

Datasets

We are going to start with a seemingly simple example from a data set which is built into R from the agridat library. The dataset is called beaven.barley

This dataset is from an experiment of 8 different barley varieties which compares the yields. Each variety is replicated 20 times in this experiment on the same field.

The grid layout of the field is captured by the row and column fields, the different barley varieties are in the gen column and the yields are in the yield column.

DT::datatable(beaven.barley)

Exercise 1

Question 1

To begin with let us assume our observations in beaven.barley are independent - they are all on the same field after all!

Let's start with some simple summary plotting ...

Question 1 : Make a boxplot showing yields (yield) by genotype (gen) using ggplot. Interpret what this is showing

Please press "Continue" to reveal the solution

Solution 1

Starting with a basic boxplot using the lessons learned from other modules. Plotting the categorical genotype (gen) along the x axis and the continuous yield along the y axis.

ggplot(beaven.barley,aes(y=yield,x=gen))+
  geom_boxplot()

Our plot shows some considerable variation in the means and data spreads across our genotypes. c seems to have the highest yield on average and also the smallest spread. While h has the lowest yield on average though with one of the highest ranges of values.

Please press "Continue" to move to the next question

Question 2

Now let's add some summary statistics ...

Question 2 : Calculate the mean and standard deviations of yield by genotype using group_by and summarise

Please press "Continue" to reveal the solution

Solution 2

beaven.barley %>%
  group_by(gen) %>%
    summarise(mean = mean(yield),
              standard_deviation = sd(yield))

Our summary statistics reflect what we could see in our box plot with genotypes c,d and g having higher on average yields while f and h are relatively low on average.

Please press "Continue" to move to the next question

Question 3

Now we have explored the data, it is time to start fitting our model

Question 3: Fit an ANOVA model of yield (yield) by genotype (gen) using the lm function. Store it as an object called barley1`

barley1<-lm()

Please press "Continue" to reveal the solution

Solution 3

First let's set up how we will store this model. Remember that whenever saving something as an object to use the <- assign operator.

As for the model code, for a simple ANOVA model, we use the lm (linear model) function. Within this we start with our model formula. We put our dependent variable (y) on the left, followed with a tilda ~ to denote a formula, we then put in our categorical independent variable (x). So in this case our formula becomes yield~gen.

Finally, unlike in other functions we have seen throughout these modules, we end the function with our data argument.

As we are storing this as a object there will be no output. But you can type the model name by itself to show a quick breakdown of the model coefficients.

barley1<-lm(yield~gen,data=beaven.barley)

barley1

Please press "Continue" to move to the next question

Question 4

But what does this model show us?

Question 4: Use the summary and anova functions on the model (barley1) to provide some statistics about the results. What do these tell you?

Please press "Continue" to reveal the solution

Solution 4

First, let's look at summary.

summary(barley1)

This provides a general summary of our model including the coefficients, their significance, data about the spread of residuals, the model R-squared and the model's significance.

In this case we can make a few inferences from this table. Firstly, it appears none of our individual coefficients are significant, this may seem bad but if you remember all coefficients are the differences in average yield relative to our reference level.

In this case our reference level is genotype a, since by default R will order them alphabetically. If you think back to our boxplots, genotype a looked somewhat reflective of the overall average. So this not being statistically different at the 5% level to any other genotype is not too surprising. It would be better to consider the significance of the variable as a whole.

We can do this by looking at the anova function instead.

anova(barley1)

Here we are considering how genotype as a whole is explaining variation, not just each genotype individually. We have a p-value of less than 0.05 suggesting that genotype is explaining a significant amount of variation in yield.

Please press "Continue" to move to the next question

Question 5

Question 5: Use the plot function to check the residuals. Does anything look problematic?

Please press "Continue" to reveal the solution

Solution 5

plot(barley1)

Let's scroll through each plot individually.

In our first plot we are plotting the residuals against the fitted values from the model to test for linearity of the errors (residuals). As with many residual plots, we are looking for no pattern and a straight line across the middle. We seem to be satisfying this assumption pretty well, we only have 3 observations being tagged as perhaps being slightly off compared to the rest of the data but these should not being having a significant effect.

Secondly, we are testing that the residuals are normally distributed. To satisfy this we are looking for the points to match closely along the dotted line running diagonally across the space. We have some slight deviation at the bottom and lower ends but the majority of points are fitting fairly nicely along the diagonal. Again, we are not in violation of this assumption.

Next we are looking for the residuals to have a constant variance. Once again, we want a straight line across the middle. No increasing patterns or funnel shapes. There is a slight dip at the higher end but this is not a big enough deviation to be problematic.

Finally we are looking at the leverage of observations. This will tell us if there are any outlying observations that have an inordinate amount of influence over our results. A few observations have been flagged for having higher residuals, but they are within expected ranges.

So overall, we are not in violation of any of the model assumptions.

Please press "Continue" to move to the next question

Question 6

Question 6: Use the emmeans() function to obtain the estimated means and confidence intervals for each genotype.

Please press "Continue" to reveal the solution

Solution 6

First let's look at emmeans, this will provide the means and confidence intervals for the means as they are estimated from the model.

We first put in the name of our model, then we use a tilda ~ and name our genotype variable as we wish to see the means split by this variable.

emmeans(barley1,~gen)

You may notice that the means match up with the means we found earlier when using group_by and summarise. This is what should happen when calculating a simple model with only one predictor variable. But now we importantly have confidence intervals attached to these estimates.

Please press "Continue" to move to the next question

Question 7

Question 7: Take your answer from Question 6 and pipe this into the cld() function to conduct a mean separation analysis

Please press "Continue" to reveal the solution

Solution 7

Now let's use cld to compare the individual means, statistically.

emmeans(barley1,~gen) %>%
  cld()

Remember that we interpret the .group column such that any pairs which do not have a matching group number as being statistically different from one another. In our case, we have one significant difference, genotype h (272) is significantly lower than genotype c (319). Every other pairwise comparison results in a non-significant result.

Please press "Continue" to move to the next exercise

Exercise 2

Question 1

When we completed exercise 1 we assumed all points were independent and the layout factors were not relevant. Let's investigate this further, it may not have been the case.

Question 1 : Using the tabyl function produce two tables to show the frequencies of how many times each genotype appears in each row (table 1) and each column col (table 2)

#Table 1
beaven.barley %>%
  tabyl()

#Table 2
beaven.barley %>%
  tabyl()

You should see that each genotype appears 4 times per row. Not all genotypes appear in all columns but there is clearly an attempt to make a balanced design.

Please press "Continue" to reveal the solution

Solution 1

Question 1 : Using the tabyl function produce two tables to show the frequencies of how many times each genotype appears in each row (table 1) and each column (table 2)

First let's look at the row variable.

beaven.barley%>%
  tabyl(gen,row)

As stated in the exercises you should now see that each genotype is present in all five rows, four times.

But let's look at the columns.

If every column also contained an instance of each genotype then we would have all 1s in each cell. But we have many 0s, meaning while every row contains 4 instances of each genotype, no column contains every genotype (this shouldn't be surprising given we have 5 plots per column but 8 genotypes). Therefore, the trial is not fully balanced but an attempt was made to create something close.

But this means we should start to consider how this could have impacted our assumptions about independence.

beaven.barley%>%
  tabyl(gen,col)

Please press "Continue" to move to the next question

Question 2

Question 2: Make two boxplots showing the yield distributions in each row (1) and then in each column col (2).What do you notice? Hint - the warning message you may see is trying to help you! Pay attention to what it says.

#Plot 1

#Plot 2

Please press "Continue" to reveal the solution

Solution 2

Question 2: Make two boxplots showing the yield distributions in each row (1) and then in each column (2).What do you notice? Hint - the warning message you may see is trying to help you! Pay attention to what it says.

If you received a warning message, this will be because if you only specified the x and y aesthetics R will be unsure what to do about x as row is a continuous variable. The warning message should help you by suggesting the addition of the group aesthetic. So we can add this to our aesthetics, and set it to be the row variable. You should then see individual boxes instead of one large one.

ggplot(beaven.barley,aes(y=yield,x=row,group = row))+
  geom_boxplot()

So along our rows, we can see a bit of variation though without any massive deviations. But of course we have equal numbers of each genotype in the rows so we should expect minimal variation since genotype is technically controlled for. Therefore, the variation we are seeing is associated with the row placement instead.

Now let's consider our columns where we know that the genotype selections are not consistent.

ggplot(data=beaven.barley,aes(y=yield,x=col,group=col))+
  geom_boxplot()

There is a LOT of variation here, we even see a seemingly increasing trend of the location of our boxes whereby those in the higher number columns generally have a higher yield than those in the lower numbered columns. Some of this variation will be due to the differing genotype compositions but this increasing trend does seem curious as looking back at our table of genotype by col, there does not seem to be an obvious pattern in the genotype allocation by column.

These boxplots suggest we cannot conclude that results are completely independent of the plot's location and we should consider this in our modelling approach.

Please press "Continue" to move to the next question

Question 3

Question 3: Let's now fit a model using lmer and save it an object called barley2. The random effects for this row column layout should be specified as one random effect for row, and one for column (col). So the random part of the model code should look like (1|row)+(1|col)

barley2<-lmer(????????+(1|row)+(1|col),??????)

Please press "Continue" to reveal the solution

Solution 3

Question 3: Let's now fit a model using lmer() and save it an object called barley2. The random effects for this row column layout should be specified as one random effect for row, and one for column. So the random part of the model code should look like "(1|row)+(1|col)"

Most of the code remains the same as the model we created in the first exercise but we now add the random effects using the code in the question.

barley2<-lmer(yield~gen+(1|row)+(1|col),data=beaven.barley)

Please press "Continue" to move to the next question

Question 4

Question 4: Use the summary function to investigate this model barley2, and compare it to the results from exercise 1. What do you notice? Can you explain the differences?

Please press "Continue" to reveal the solution

Solution 4

Question 4: Use the summary function to investigate this model barley2, and compare it to the results from exercise 1. What do you notice? Can you explain the differences?

summary(barley2)

We have plenty of information provided so let's take this slowly. We have our fixed effects for genotype in the middle, presented in the same fashion as our first model back in exercise 1.

Compared to exercise 1, where we found no significant differences at the 5% level between genotype a and the others, we now find some significant results. Namely that genotypes f and h seem to be significantly lower than genotype a once accounting for the variation related to the row and column placements.

Looking at the random effects table we can see breakdown of variation that can be explained by our random effects of col and row. Here we have standard deviations of 35.36 and 14.42 respectively, suggesting that the col part of our design is having a larger impact on the variance of yield than row. Our residual error term of 28.61 sits roughly in the middle of these values, suggesting our design is having quite an important influence on yield.

Please press "Continue" to move to the next question

Question 5

Question 5: Retrieve the model R-squared using r.squaredGLMM. What is your interpretation of the result?

Please press "Continue" to reveal the solution

Solution 5

Question 5: Retrieve the model R-squared using r.squaredGLMM. What is your interpretation of the result?

Let's look at our r2 for some more information

r.squaredGLMM(barley2)

So remember that R2m tells us the amount of variation explained by our fixed effects gen, this is 9.6% for this model.

R2c then tells us the amount explained by both our random effects and the fixed effects. This rises to 67.5% percent, a rise by nearly 60 percentage points. Suggesting that a large amount of our variation is actually explainable by the plot arrangement and therefore our experimental design.

Please press "Continue" to move to the next question

Question 6

Question 6: Use the anova function to investigate this model, and compare it to the results from exercise 1

Please press "Continue" to reveal the solution

Solution 6

Question 6: Use the anova function to investigate this model, and compare it to the results from exercise 1

We can now plug this into an anova table

anova(barley2)

Remember that we will only see the significance of our fixed effects. As with the exercise 1 model, our genotype is significant overall.

Please press "Continue" to move to the next question

Question 7

Question 7: Now check the assumptions of barley2 using the plot. Are we happy with the model?

Please press "Continue" to reveal the solution

Solution 7

Question 7: Now check the assumptions of barley2 using the plot. Are we happy with the model?

We can now take a look at our assumptions plot, while only 1 plot is produced unlike the 4 we got with our simple ANOVA model, we can make inferences about most of our assumptions.

We can see no changing shape or spread to our points, so we can assume no violation of equal variance. There is perhaps one observation out of line with the rest of the data with a much higher residual but i think this is not problematic in terms of the leverage of certain observations on the model results.

plot(barley2)

Please press "Continue" to move to the next question

Question 8

Question 8: Carry out the mean separation analysis on the new model barley2, to obtain adjusted means, confidence intervals, and the letter display. Compare again to the results from exercise 1 - what do you notice?

Please press "Continue" to reveal the solution

Solution 8

Question 8: Carry out the mean separation analysis on the new model, to obtain adjusted means, confidence intervals, and the letter display. Compare again to the results from exercise 1 - what do you notice?

emmeans(barley2,~gen) %>%
  cld()

There are quite a few differences to our exercise 1 results.

There we only had one significant difference, genotype h and c. That difference is still true but now genotype h is also different from genotypes e,b,g and a. Only not being significantly different from genotypes f and d which were also below 300 on average. Additionally, genotype f is significantly different from genotypes e and c. While genotype d is not statistically different from any other genotype.