Statistical Analysis: Part 6 - Generalized Linear Models

Generalised Linear Model

In a previous module, we showed you had to create a simple ANOVA model using R.Using anova we created a model to assess how yields varied amongst treatments within an experimental trial. However, we have identified the assumption that all of our observations are independent may not be reasonable within our design.

This is very common in agricultural analyses which will have specific layouts or involve operating with multiple farmers - this will invalidate the assumption that all observations are independent. In these sort of cases we need to move from a "general linear model" into a "generalised linear model" which can account for these structures. This is also known as a mixed model since it contains "fixed effects" (variables we are interested in) and "random" effects (variables outlining the structure or clustering of observations in the data) You can some materials on mixed models in the following two links, we recommend the latter if you wish to see a more agricultural focused example

UCLA

E-book

Mathematically this is quite a bit more complicated - but thankfully in R it is not very different at all to move towards this model, you will see many of the same functions as in the previous sections. We will use the same libraries we have seen so far in other parts of this module, but now also add the lmertest library to be able to fit a "linear mixed effects regression" (lmer()) model. This approach is incredibly flexible and can let us take into account any of the standard agricultural experimental designs as long as we are able to code the random effects appropriately.

But it also is the exact same approach we need for data which comes from less structured designs - from farm trials and surveys - where we have different forms of clustering (farmers within the same village) or pseudo-replication (the same farmer with multiple plots). It can be tricky to work out exactly how to structure the code for your model, don't be afraid to ask for help if you are unsure! There are resources at the end of this document which cover many of the more common agricultural designs - but the functions used will all be the same. This even applies to cases where we start to use non-Normal models - such as Poisson regression, for count data, or logistic regression, for binary data. In these cases we use again more or less the same syntax, and all the same functions, but now using the glmer() function, which can account for these modifications to the type of variable we use as a response.

It does not matter at all if we have continuous predictor variables, or categorical predictor variables, or a mixture of the two. All these models are fitted and interpreted in R in more or less exactly the same way.

You can check the next two pages if you wish to remind yourself of the data and packages we have been using in this statistical analysis course.

Once done with this tutorial, you can move onto our second set of statistical analysis exercises

Data

The data used in this example is from a study was conducted in Eastern Zambia with the main objective of improving the efficiency of the natural fallows by using appropriate trees that may have relevance in soil fertility regeneration within permissible fallow periods.

The data has been read into R and assigned to an object called fallow.

The design was a RCBD experiment (randomised complete block design) with 4 blocks and 9 treatments. If you take a look at the data below you will see that we have data concerning the measurement of the yield (the column yield within the data) and the striga count (striga) of a plot (the column plot), within a block (the column rep), after a specific treatment (the column treat).

Packages used

In this tutorial we will use four new packages: emmeans, multcomp, lmerTest and MuMIn; in addition we will use some of the packages we have used regularly so far in this course dplyr and ggplot2. Make sure these are all installed and loaded before running any of the code on your own computer.

library(ggplot2)
library(dplyr)

library(lmerTest)
library(multcomp)
library(multcompView)
library(emmeans)
library(MuMIn)

Fitting the model

In this design, an RCBD (randomised complete block design), we have one treatment factor, "treat", and one layout factor "rep". If we want to take into account this repeated blocks layout structure, we can’t use the “lm” function anymore and instead we need to include the “rep” blocking factor into the specification of a linear mixed effect model. Using the lmer() function to include a blocking factors, or "random effects" as they are known in statistics, we use a slightly different style of code. We use brackets and then within the brackets we write 1|variablename , so in this case (1|rep). Then the rest of the model code is the same as before - this time I will store as an object called gen_lin_mod.

gen_lin_mod<-lmer(yield~treat+(1|rep),data=fallow)

Again - to get anything useful out of the model we need to use additional functions - we can start with summary().

summary(gen_lin_mod)

The output looks a little different after we summarise a mixed effects model, although most things remain very similar. The table of coefficients is the same as before and has the same interpretation. The additional output showing the correlation between fixed effects is not very useful for us here, as in this designed experiment all treatments are replicated the same number of times.

The residual error term is still presented but in a different place - this value is 0.4843 which can be found in the "Random Effects" table. This sort of model breaks down the variance into that which can be explained by our design factors (the rep term) and that which is left over as residual variance. Comparing the amount of the variance explained by each component can be useful to give us an understanding of how much our design factors are influencing our outcome variable. In this case the amount of variance explained by the rep (standard deviation = 0.1175) is fairly small in relation to the residual variance (standard deviation = 0.4843).

One thing which is missing when we compared to output of running summary after an lm is the R-squared value. This can be obtained by using the function r.squaredGLMM which comes from the MuMIn library.

r.squaredGLMM(gen_lin_mod)

Two versions of R-squared as shown - the marginal R square R2m- which shows only the variance explained by the "fixed effects" in this case the treatment term. The conditional R-squared R2c shows the effect of both the fixed and random effects - so includes both rep and treat. So we can see 81% of the variability in yield is explained by the different treatments, and this increases to 82.3% is we include rep as well.

To obtain an ANOVA table - we can obtain this as before using anova(model)

anova(gen_lin_mod)

The ANOVA table only includes the fixed effects - showing us a highly significant p-value for treatment, providing evidence of some overall differences between our treatments. We could also test whether our random effects are significant using ranova.

ranova(gen_lin_mod)

This does not show us a significant effect of the different reps we have in the data. However because this variable is describing the structure of our data, this is usually not something we are particularly interested in. With structural variables, we generally want to account for any impact they may have on the analysis whether or not they could be considered to be statistically significant. But to help your understanding, note that this non-significance would suggest that the results we got from the last section using the “lm” function was probably somewhat reliable, despite not having taken into account the layout structure in blocks.

Check the model

We have many of the same assumptions to check with this model as well, although the functions here work a little differently to before.

The function plot() will only plot the fitted values from the model against the expected values.

plot(gen_lin_mod)

We can check most of the key assumptions from this one plot. The residual Vs fitted plot is a scatter plot of the Residuals on the y-axis and the fitted on the x-axis and the aim for this plot is to test the assumption of equal variance of the residuals across the range of fitted values. Since the residuals do not funnel out (to form triangular/diamond shape) the assumption of equal variance is met.

We can also see that there are no extreme values in the residuals which might be potentially causing problems with the validity of our conclusions (leverage)

To assess the assumption of approximate normality we can produce the same plot as before if we extract the residuals and use the function qqnorm. This shows us how closely the residuals follow a normal distribution - if there are severe and systematic deviations from the line then we may want to consider an alternative distribution. The most elegant way of doing this is to use pipes - firstly to obtain the residuals and then to pipe those residuals into the qqnorm function..

gen_lin_mod %>%
   resid() %>%
      qqnorm()

In this case the residuals seem to fit the assumption required for normality.

Post-hoc tests

The functions we saw in the previous section, emmip and emmeans and cld all work in exactly the same way. All we need to change is the input model object, and we obtain the plots with confidence intervals and the mean separation analysis.

emmip(gen_lin_mod,~treat,CIs = TRUE)

emmeans(gen_lin_mod, ~treat) %>%
  cld(Letters=letters)

In the output, groups sharing a letter in the .group are not statistically different from each other as before.

Using the example from the previous section, see if you can write the code to make a plot showing the results from the lmer model including the letter displays and the error bars

You can see the data visualisation modules here for a reminder

emmeans(gen_lin_mod, ~treat) %>%
  cld(Letters=letters)  %>% 
  data.frame() %>%
  ggplot(aes(y=emmean,x=treat,ymin=lower.CL,ymax=upper.CL,label=.group))+
    geom_point()+
      geom_errorbar()+
        geom_text(nudge_x=-0.2)

Model choices

Ultimately our model fitted using a generalised linear mixed model, with a random effect to account for block, using lmer provided almost identical results to the standard linear model fitted using lm. As seen below.

summary(gen_lin_mod)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ treat + (1 | rep)
##    Data: fallow
## 
## REML criterion at convergence: 51.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.98578 -0.40272 -0.00098  0.47492  1.88450 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  rep      (Intercept) 0.01382  0.1175  
##  Residual             0.23458  0.4843  
## Number of obs: 36, groups:  rep, 4
## 
## Fixed effects:
##                   Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)         5.5475     0.2492 26.3478  22.262  < 2e-16 ***
## treat2 G.sepium    -1.7025     0.3425 24.0000  -4.971 4.47e-05 ***
## treat3 L.leuco     -1.8725     0.3425 24.0000  -5.468 1.28e-05 ***
## treat4 F.congesta  -2.0875     0.3425 24.0000  -6.095 2.70e-06 ***
## treat5 C.siamea    -3.4425     0.3425 24.0000 -10.052 4.44e-10 ***
## treat6 C.calo      -2.9600     0.3425 24.0000  -8.643 7.82e-09 ***
## treat7 groundnut   -2.4650     0.3425 24.0000  -7.198 1.94e-07 ***
## treat8 maize       -3.5725     0.3425 24.0000 -10.431 2.14e-10 ***
## treat9 nat fallow  -2.7375     0.3425 24.0000  -7.993 3.20e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) trt2G. trt3L. trt4F. trt5C. trt6C. trt7gr trt8mz
## treat2G.spm -0.687                                                 
## treat3L.lec -0.687  0.500                                          
## trt4F.cngst -0.687  0.500  0.500                                   
## treat5C.sim -0.687  0.500  0.500  0.500                            
## treat6C.cal -0.687  0.500  0.500  0.500  0.500                     
## tret7grndnt -0.687  0.500  0.500  0.500  0.500  0.500              
## treat8maize -0.687  0.500  0.500  0.500  0.500  0.500  0.500       
## tret9ntfllw -0.687  0.500  0.500  0.500  0.500  0.500  0.500  0.500

summary(anova_mod)

## 
## Call:
## lm(formula = yield ~ treat, data = fallow)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02250 -0.25000  0.02625  0.21125  0.99250 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.5475     0.2492  22.262  < 2e-16 ***
## treat2 G.sepium    -1.7025     0.3524  -4.831 4.80e-05 ***
## treat3 L.leuco     -1.8725     0.3524  -5.313 1.31e-05 ***
## treat4 F.congesta  -2.0875     0.3524  -5.923 2.59e-06 ***
## treat5 C.siamea    -3.4425     0.3524  -9.768 2.34e-10 ***
## treat6 C.calo      -2.9600     0.3524  -8.399 5.20e-09 ***
## treat7 groundnut   -2.4650     0.3524  -6.995 1.61e-07 ***
## treat8 maize       -3.5725     0.3524 -10.137 1.06e-10 ***
## treat9 nat fallow  -2.7375     0.3524  -7.768 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4984 on 27 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8047 
## F-statistic: 19.03 on 8 and 27 DF,  p-value: 3.144e-09

However, this is because the blocking in this experiment had very little impact on the results - it was an on-station trial with very controlled conditions and similar environments across the four blocks.

In increasingly complex situations, particularly those with farmers conducting multiple treatments or with clusters of farmers in very different situations, very different results would be obtained from these two modelling approaches. Plots on one farmers field will be similar to other plots on the same field but likely extremely different to another farmer's plots.

In cases where very different results are found then it is usually worth trying to understand why! You will see an example of this in the exercises.

Methodological Principles

There are always many different ways of doing all that we have done here in R. If you look for guidance online about how to analyse particular designs you may find lots of different options for analysing "split plot" designs or "survey" data. It is always a good idea to ask for help when starting to analyse data if you are not sure about the way it should be analysed - you should make sure however you analyse the data it is done in a way which lets you address the questions that are most relevant to you. This is the big advantage of using the approach shown in this tutorial of the lmer and lm functions - they are incredibly flexible and can be used in many different situations. For instance, we demonstrated fitting our model as a linear mixed effect model rather than traditional ANOVA. Although the analysis in this instance was not effected too much by this change the lmer model has a number of other advantages as well:

They are very flexible especially where we have repeated measures, for instance you don’t need to have the same number of observations per subject/treatment.
Ability to account for a series of random effects. Not only are farms/farmers/plots…. different from each other, but things with in farms/plots….. also differ . Not taking these sources of variation into account will lead to underestimations of accuracy.
Allows for generalization of non-normal data.
Handling missing data: If the percentage of missing data is small and that data missing is a random sample of the data set,data from the observations with missing data can be analysed with lmer (unlike other packages that would do list-wise deletion).
Takes into account variation that is explained by the predictor variables of interest i.e. fixed effects and variation that is not explained by these predictors i.e. random effects.