Statistical Analysis: Part 3

Analysis of Variance

In the previous tutorial we looked at how to start with a simple linear regression in R using the lm function.

Simple linear regression can be used when we have a continuous outcome variable (like yield) and a continuous predictor variable (like striga). Often when being taught about statistics we are then taught separately about Analysis of Variance (ANOVA), when we have a continuous outcome variable and a categorical predictor variable, as if this was a completely different method. However, mathematically, theoretically and in terms of how R treats them, these are both identical methods. There is a nice explanation of this here.

Both of these methods are examples of 'general linear models', and to fit an ANOVA model we use nearly all the same functions as we used in the previous section. We will use some more functions to explore our model results this time, and these will use some additional packages to help make interpretations across the groups. We will use the emmeans and multcomp packages to estimate marginal means and then conduct a mean separation analysis.

So lets begin and conduct a one-way analysis of variance to compare whether there is any evidence of differences in average yield across the different treatment groups.

Data

The data used in this example is from a study was conducted in Eastern Zambia with the main objective of improving the efficiency of the natural fallows by using appropriate trees that may have relevance in soil fertility regeneration within permissible fallow periods.

The data has been read into R and assigned to an object called fallow.

The design was a RCBD experiment (randomised complete block design) with 4 blocks and 9 treatments. If you take a look at the data below you will see that we have data concerning the measurement of the yield (the column yield within the data) and the striga count (striga) of a plot (the column plot), within a block (the column rep), after a specific treatment (the column treat).

Packages used

In this tutorial we will use four new packages: emmeans, multcomp, lmerTest and MuMIn; in addition we will use some of the packages we have used regularly so far in this course dplyr and ggplot2. Make sure these are all installed and loaded before running any of the code on your own computer.

library(ggplot2)
library(dplyr)

library(lmerTest)
library(multcomp)
library(multcompView)
library(emmeans)
library(MuMIn)

Summary Statistics

When we have a grouping factor like treat the summary statistics and plots we might want to make are very similar to what we have been doing so far. Try to replicate the table below showing the mean and standard deviation of yield by treatment. If you have not taken our module on data manipulation, feel free to check the solution.

## # A tibble: 9 × 3
##   treat        `mean(yield)` `sd(yield)`
##   <chr>                <dbl>       <dbl>
## 1 1 S.sesban            5.55       0.763
## 2 2 G.sepium            3.84       0.496
## 3 3 L.leuco             3.68       0.369
## 4 4 F.congesta          3.46       0.241
## 5 5 C.siamea            2.10       0.271
## 6 6 C.calo              2.59       0.173
## 7 7 groundnut           3.08       0.741
## 8 8 maize               1.98       0.632
## 9 9 nat fallow          2.81       0.403

fallow %>%
  group_by(treat) %>%
    summarise(mean(yield),sd(yield))

Summary plots

Given we only have four observations per treatment it may not be appropriate to use a boxplot here as the number of observations per group is very small. So in this case plotting using a point geometry may be a better way of visualising the data

You can review our tutorials on data visualisation if needed

ggplot(fallow,aes(y=yield,x=treat))+
  geom_point()

Both the summary statistics and the plot above shows us pretty clearly that treatment "1 S.sesban" provides higher yields than the other treatments. The lowest yield for this treatment is still higher than any of the yields from any of the other treatments. Among the other 8 treatments we also see some which are performing less well - such as the treatments "5 C.siamea" and "8 maize" which have consistently low yields.

Specify a model for data

The syntax and function for creating an ANOVA model is identical to a regression model. The only difference is that the predictor variable must be a categorical variable (like treat) rather than a continuous variable (like striga). So - lets use the same lm() function. Like before when we run this by itself we don't get too much useful output.

lm(yield~treat,data=fallow)

So, exactly as in the previous section, we nearly always want to assign this to an object and then use various functions to investigate the model further.

summary() again provides us with the model summary information

anova_mod<-lm(yield~treat,data=fallow)

summary(anova_mod)

## 
## Call:
## lm(formula = yield ~ treat, data = fallow)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02250 -0.25000  0.02625  0.21125  0.99250 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.5475     0.2492  22.262  < 2e-16 ***
## treat2 G.sepium    -1.7025     0.3524  -4.831 4.80e-05 ***
## treat3 L.leuco     -1.8725     0.3524  -5.313 1.31e-05 ***
## treat4 F.congesta  -2.0875     0.3524  -5.923 2.59e-06 ***
## treat5 C.siamea    -3.4425     0.3524  -9.768 2.34e-10 ***
## treat6 C.calo      -2.9600     0.3524  -8.399 5.20e-09 ***
## treat7 groundnut   -2.4650     0.3524  -6.995 1.61e-07 ***
## treat8 maize       -3.5725     0.3524 -10.137 1.06e-10 ***
## treat9 nat fallow  -2.7375     0.3524  -7.768 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4984 on 27 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8047 
## F-statistic: 19.03 on 8 and 27 DF,  p-value: 3.144e-09

The output looks a little different to the simple linear regression output. Many things are the same, and have the same interpretation. The R square value (84.9%) and residual standard error (0.4984) are in the same positions and have the same interpretations as before.

However we now have a lot more coefficients - and these have a slightly different meaning when considering a categorical variable in an ANOVA model. These coefficients represent the difference between the mean values in each treatment and the mean value of the reference level. The reference level, by default, is the first level of our categorical variable when sorted alphabetically - in this case treatment "1 S.sesban". The intercept value represents the mean value within this reference level.

Therefore the p-values shown in the summary output represent significance tests of hypotheses we may not be directly interested in. The test for the intercept is testing the hypothesis that the mean value of treatment 1 is 0. This is probably not a useful hypothesis. The tests for each of the coefficients are comparing whether each treatment is significantly different from treatment 1. This is somewhat useful, but it only tells us about the treatments in relation to the reference level. It cannot tell us about say, treatment 2 vs treatment 7, and it cannot tell us whether there is an overall effect of treatment. Although in this case the answer to that should be pretty clear!

This is why we also often see an ANOVA table to show whether there is an overall effect of a categorical variable. The null hypothesis in this ANOVA table is that all of the treatments have the same mean yield. If it is a significant p-value then this suggests that at least one of the treatments has a different mean yield from the others. We can access the ANOVA table for a model by using the anova() function.

anova(anova_mod)

Unsurprisingly this tells us that we have an overall significant treatment effect p = 3.144e-09. Remember that this is in scientific notation; another way of writing this number would be 0.000000003144

Checking Assumptions

We have all of the same assumptions to check as when we ran the linear regression earlier, and we can assess the model checking plots using plot(). The only difference is that we are no longer checking for linearity, since we have 9 separate groups rather than a linear trend. But otherwise we still want to check if there are issues with heterogeneity, high leverage values, outliers and if there are severe violations of normality.

plot(anova_mod)

Again - there are no clear violations of the assumptions being shown. However, as with the linear model, we may consider the possibility that we have a violation of the independence assumption. This would depend on exactly how the trial was designed:

If we had one large field and each of the 36 plots was allocated randomly to the 9 treatments, a completely randomised design, then there would be no violation of independence; all of the plots would come from the same population.

If we had 4 'blocks' in different places across our station, and each block had one plot for each treatment, a completely randomised block design, then we would have a violation of independence. Since there would be four physical 'blocks' - we may expect plots in Block 1 to be similar to each other, and different to plots in Block 2. Whether or not this is likely to be an issue may depend on how variable the conditions are across the different blocks.

If we had 4 farmers, and each farmer plotted each treatment once, this is definitely a violation of the independence assumption. We cannot consider farmers to be replicates!

In the next tutorial we will introduce generalised linear models to show you how to extend the models to allow for this.

Model Outputs and Post-Hoc tests

In agricultural analyses it is very common to conduct post-hoc analyses from ANOVA models, often referred to as 'mean separation'. We can do this using the emmeans and multcomp libraries.

First though, we may want to make some plots - showing the mean values of each of our treatments and including some confidence intervals around these estimates. We can do this using the emmip() function.

This function needs first the name of the model (in this case anova_mod), then a tilde ~ followed by the name of the grouping variable (in this case ~treat.

emmip(anova_mod,~treat)

If we want confidence intervals to be included in the plot we also add an option CI=TRUE

emmip(anova_mod,~treat,CI=TRUE)

This is a really nice function for making plots from models when we increase the complexity of our models, so it is worth learning about now!

The data in this plot can be extracted using the emmeans() function. We don't need to ask for confidence intervals this time - they appear anyway!

emmeans(anova_mod,~treat)

If you wanted to customise the output of emmip so that it looks nicer for your presentations, then you can obtain the numbers underlying the plot from this call to emmeans and then pipe it into ggplot. However we need to first convert the output into a data frame. This is a good opportunity to show you geom_errorbar. This geom requires two additional aesthetics we have not used in our data visualisation tutorials - ymax to represent the top of the error bars and ymin for the bottom of the error bars.

emmeans(anova_mod,~treat) %>%
  data.frame() %>%
  ggplot(aes(y=emmean,x=treat,ymin=lower.CL,ymax=upper.CL))+
    geom_point()+
      geom_errorbar()

It is from this output that we can then conduct the mean separation analysis, by piping from this output into the cld() function. "cld" stands for "compact letter display", which is an alternative name for mean separation.

emmeans(anova_mod,~treat) %>%
  cld()

By default this compares groups using the Tukey method, although others are available if you want them. The idea is that we can conclude that any treatments which do not have any overlapping entries in the .group column can be concluded to be significantly different. So here we can see that treatment 1 is significantly higher yielding than all other treatments; treatment 2 is significantly higher yielding than treatments 8,5 and 6, and so on.

I tend to find agricultural researchers don't like seeing numbers to denote the groups - and much prefer letters. We can achieve this by adding the option Letters=letters. Note the case of this option!

emmeans(anova_mod,~treat) %>%
  cld(Letters=letters)

You can pipe these letters onto your plot with errorbars similarly to what we saw in the previous section. And this is now a good opportunity to show you another geom - geom_text for adding text onto a plot. This requires another aesthetic - label to represent the column containing the text labels to be plotted.

emmeans(anova_mod,~treat) %>%
  cld(Letters=letters) %>% 
  data.frame() %>%
  ggplot(aes(y=emmean,x=treat,ymin=lower.CL,ymax=upper.CL,label=.group))+
    geom_point()+
      geom_errorbar()+
        geom_text()

To make the labels a little bit to the left of the points I am going to use an additional argument within geom_text to move the x position of the label. It is plotted at the moment over the x and y co-ordinates as defined in the aesthetics. But we can use nudge_x and/or nudge_y to move this a little.

emmeans(anova_mod,~treat) %>%
  cld(Letters=letters) %>% 
  data.frame() %>%
  ggplot(aes(y=emmean,x=treat,ymin=lower.CL,ymax=upper.CL,label=.group))+
    geom_point()+
      geom_errorbar()+
        geom_text(nudge_x=-0.2)

In the next tutorial we will introduce multiple linear regression where we will start adding more than one explanatory variable