Generalised Linear Model
In a previous module, we showed you had to create a simple ANOVA
model using R.Using anova we created a model to assess
how yields varied amongst treatments within an experimental trial.
However, we have identified the assumption that all of our observations
are independent may not be reasonable within our design.
This is very common in agricultural analyses which will have specific layouts or involve operating with multiple farmers - this will invalidate the assumption that all observations are independent. In these sort of cases we need to move from a “general linear model” into a “generalised linear model” which can account for these structures. This is also known as a mixed model since it contains “fixed effects” (variables we are interested in) and “random” effects (variables outlining the structure or clustering of observations in the data) You can some materials on mixed models in the following two links, we recommend the latter if you wish to see a more agricultural focused example
Mathematically this is quite a bit more complicated - but thankfully
in R it is not very different at all to move towards this model, you
will see many of the same functions as in the previous sections. We will
use the same libraries we have seen so far in other parts of this
module, but now also add the lmertest library to be able to
fit a “linear mixed effects regression” (lmer()) model.
This approach is incredibly flexible and can let us take into account
any of the standard agricultural experimental designs as long as we are
able to code the random effects appropriately.
But it also is the exact same approach we need for data which comes
from less structured designs - from farm trials and surveys - where we
have different forms of clustering (farmers within the same village) or
pseudo-replication (the same farmer with multiple plots). It can be
tricky to work out exactly how to structure the code for your model,
don’t be afraid to ask for help if you are unsure! There are resources
at the end of this document which cover many of the more common
agricultural designs - but the functions used will all be the same. This
even applies to cases where we start to use non-Normal models - such as
Poisson regression, for count data, or logistic regression, for binary
data. In these cases we use again more or less the same syntax, and all
the same functions, but now using the glmer() function,
which can account for these modifications to the type of variable we use
as a response.
It does not matter at all if we have continuous predictor variables, or categorical predictor variables, or a mixture of the two. All these models are fitted and interpreted in R in more or less exactly the same way.
You can check the next two pages if you wish to remind yourself of the data and packages we have been using in this statistical analysis course.
Once done with this tutorial, you can move onto our second set of statistical analysis exercises
Data
The data used in this example is from a study was conducted in Eastern Zambia with the main objective of improving the efficiency of the natural fallows by using appropriate trees that may have relevance in soil fertility regeneration within permissible fallow periods.
The data has been read into R and assigned to an object called
fallow.
The design was a RCBD experiment (randomised complete block design)
with 4 blocks and 9 treatments. If you take a look at the data below you
will see that we have data concerning the measurement of the yield (the
column yield within the data) and the striga count
(striga) of a plot (the column plot), within a
block (the column rep), after a specific treatment (the
column treat).
Packages used
In this tutorial we will use four new packages: emmeans,
multcomp, lmerTest and MuMIn; in
addition we will use some of the packages we have used regularly so far
in this course dplyr and ggplot2. Make sure
these are all installed and loaded before running any of the code on
your own computer.
library(ggplot2)
library(dplyr)
library(lmerTest)
library(multcomp)
library(multcompView)
library(emmeans)
library(MuMIn)
Fitting the model
In this design, an RCBD (randomised complete block design), we have
one treatment factor, “treat”, and one layout factor “rep”. If we want
to take into account this repeated blocks layout structure, we can’t use
the “lm” function anymore and instead we need to include the “rep”
blocking factor into the specification of a linear mixed effect model.
Using the lmer() function to include a blocking factors, or
“random effects” as they are known in statistics, we use a slightly
different style of code. We use brackets and then within the brackets we
write 1|variablename , so in this case
(1|rep). Then the rest of the model code is the same as
before - this time I will store as an object called
gen_lin_mod.
gen_lin_mod<-lmer(yield~treat+(1|rep),data=fallow)
Again - to get anything useful out of the model we need to use
additional functions - we can start with summary().
summary(gen_lin_mod)
The output looks a little different after we summarise a mixed effects model, although most things remain very similar. The table of coefficients is the same as before and has the same interpretation. The additional output showing the correlation between fixed effects is not very useful for us here, as in this designed experiment all treatments are replicated the same number of times.
The residual error term is still presented but in a different place -
this value is 0.4843 which can be found in the “Random Effects” table.
This sort of model breaks down the variance into that which can be
explained by our design factors (the rep term) and that
which is left over as residual variance. Comparing the amount of the
variance explained by each component can be useful to give us an
understanding of how much our design factors are influencing our outcome
variable. In this case the amount of variance explained by the
rep (standard deviation = 0.1175) is fairly small in
relation to the residual variance (standard deviation = 0.4843).
One thing which is missing when we compared to output of running
summary after an lm is the R-squared value.
This can be obtained by using the function r.squaredGLMM
which comes from the MuMIn library.
r.squaredGLMM(gen_lin_mod)
Two versions of R-squared as shown - the marginal R square
R2m- which shows only the variance explained by the “fixed
effects” in this case the treatment term. The conditional R-squared
R2c shows the effect of both the fixed and random effects -
so includes both rep and treat. So we can see
81% of the variability in yield is explained by the different
treatments, and this increases to 82.3% is we include rep as well.
To obtain an ANOVA table - we can obtain this as before using
anova(model)
anova(gen_lin_mod)
The ANOVA table only includes the fixed effects - showing us a highly
significant p-value for treatment, providing evidence of some overall
differences between our treatments. We could also test whether our
random effects are significant using ranova.
ranova(gen_lin_mod)
This does not show us a significant effect of the different
reps we have in the data. However because this variable is
describing the structure of our data, this is usually not something we
are particularly interested in. With structural variables, we generally
want to account for any impact they may have on the analysis whether or
not they could be considered to be statistically significant. But to
help your understanding, note that this non-significance would suggest
that the results we got from the last section using the “lm” function
was probably somewhat reliable, despite not having taken into account
the layout structure in blocks.
Check the model
We have many of the same assumptions to check with this model as well, although the functions here work a little differently to before.
The function plot() will only plot the fitted values
from the model against the expected values.
plot(gen_lin_mod)
We can check most of the key assumptions from this one plot. The residual Vs fitted plot is a scatter plot of the Residuals on the y-axis and the fitted on the x-axis and the aim for this plot is to test the assumption of equal variance of the residuals across the range of fitted values. Since the residuals do not funnel out (to form triangular/diamond shape) the assumption of equal variance is met.
We can also see that there are no extreme values in the residuals which might be potentially causing problems with the validity of our conclusions (leverage)
To assess the assumption of approximate normality we can produce the
same plot as before if we extract the residuals and use the function
qqnorm. This shows us how closely the residuals follow a
normal distribution - if there are severe and systematic deviations from
the line then we may want to consider an alternative distribution. The
most elegant way of doing this is to use pipes - firstly to obtain the
residuals and then to pipe those residuals into the qqnorm
function..
gen_lin_mod %>%
resid() %>%
qqnorm()
In this case the residuals seem to fit the assumption required for normality.
Post-hoc tests
The functions we saw in the previous section, emmip and
emmeans and cld all work in exactly the same
way. All we need to change is the input model object, and we obtain the
plots with confidence intervals and the mean separation analysis.
emmip(gen_lin_mod,~treat,CIs = TRUE)
emmeans(gen_lin_mod, ~treat) %>%
cld(Letters=letters)
In the output, groups sharing a letter in the .group are not statistically different from each other as before.
Using the example from the previous section, see if you can write the
code to make a plot showing the results from the lmer model
including the letter displays and the error bars
You can see the data visualisation modules here for a reminder
emmeans(gen_lin_mod, ~treat) %>%
cld(Letters=letters) %>%
data.frame() %>%
ggplot(aes(y=emmean,x=treat,ymin=lower.CL,ymax=upper.CL,label=.group))+
geom_point()+
geom_errorbar()+
geom_text(nudge_x=-0.2)
Model choices
Ultimately our model fitted using a generalised linear mixed model,
with a random effect to account for block, using lmer
provided almost identical results to the standard linear model fitted
using lm. As seen below.
summary(gen_lin_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ treat + (1 | rep)
## Data: fallow
##
## REML criterion at convergence: 51.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.98578 -0.40272 -0.00098 0.47492 1.88450
##
## Random effects:
## Groups Name Variance Std.Dev.
## rep (Intercept) 0.01382 0.1175
## Residual 0.23458 0.4843
## Number of obs: 36, groups: rep, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.5475 0.2492 26.3478 22.262 < 2e-16 ***
## treat2 G.sepium -1.7025 0.3425 24.0000 -4.971 4.47e-05 ***
## treat3 L.leuco -1.8725 0.3425 24.0000 -5.468 1.28e-05 ***
## treat4 F.congesta -2.0875 0.3425 24.0000 -6.095 2.70e-06 ***
## treat5 C.siamea -3.4425 0.3425 24.0000 -10.052 4.44e-10 ***
## treat6 C.calo -2.9600 0.3425 24.0000 -8.643 7.82e-09 ***
## treat7 groundnut -2.4650 0.3425 24.0000 -7.198 1.94e-07 ***
## treat8 maize -3.5725 0.3425 24.0000 -10.431 2.14e-10 ***
## treat9 nat fallow -2.7375 0.3425 24.0000 -7.993 3.20e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trt2G. trt3L. trt4F. trt5C. trt6C. trt7gr trt8mz
## treat2G.spm -0.687
## treat3L.lec -0.687 0.500
## trt4F.cngst -0.687 0.500 0.500
## treat5C.sim -0.687 0.500 0.500 0.500
## treat6C.cal -0.687 0.500 0.500 0.500 0.500
## tret7grndnt -0.687 0.500 0.500 0.500 0.500 0.500
## treat8maize -0.687 0.500 0.500 0.500 0.500 0.500 0.500
## tret9ntfllw -0.687 0.500 0.500 0.500 0.500 0.500 0.500 0.500
summary(anova_mod)
##
## Call:
## lm(formula = yield ~ treat, data = fallow)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02250 -0.25000 0.02625 0.21125 0.99250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5475 0.2492 22.262 < 2e-16 ***
## treat2 G.sepium -1.7025 0.3524 -4.831 4.80e-05 ***
## treat3 L.leuco -1.8725 0.3524 -5.313 1.31e-05 ***
## treat4 F.congesta -2.0875 0.3524 -5.923 2.59e-06 ***
## treat5 C.siamea -3.4425 0.3524 -9.768 2.34e-10 ***
## treat6 C.calo -2.9600 0.3524 -8.399 5.20e-09 ***
## treat7 groundnut -2.4650 0.3524 -6.995 1.61e-07 ***
## treat8 maize -3.5725 0.3524 -10.137 1.06e-10 ***
## treat9 nat fallow -2.7375 0.3524 -7.768 2.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4984 on 27 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8047
## F-statistic: 19.03 on 8 and 27 DF, p-value: 3.144e-09
However, this is because the blocking in this experiment had very little impact on the results - it was an on-station trial with very controlled conditions and similar environments across the four blocks.
In increasingly complex situations, particularly those with farmers conducting multiple treatments or with clusters of farmers in very different situations, very different results would be obtained from these two modelling approaches. Plots on one farmers field will be similar to other plots on the same field but likely extremely different to another farmer’s plots.
In cases where very different results are found then it is usually worth trying to understand why! You will see an example of this in the exercises.
Methodological Principles
There are always many different ways of doing all that we have done
here in R. If you look for guidance online about how to analyse
particular designs you may find lots of different options for analysing
“split plot” designs or “survey” data. It is always a good idea to ask
for help when starting to analyse data if you are not sure about the way
it should be analysed - you should make sure however you analyse the
data it is done in a way which lets you address the questions that are
most relevant to you. This is the big advantage of using the approach
shown in this tutorial of the lmer and lm
functions - they are incredibly flexible and can be used in many
different situations. For instance, we demonstrated fitting our model as
a linear mixed effect model rather than traditional ANOVA. Although the
analysis in this instance was not effected too much by this change the
lmer model has a number of other advantages as well:
- They are very flexible especially where we have repeated measures, for instance you don’t need to have the same number of observations per subject/treatment.
- Ability to account for a series of random effects. Not only are farms/farmers/plots…. different from each other, but things with in farms/plots….. also differ . Not taking these sources of variation into account will lead to underestimations of accuracy.
- Allows for generalization of non-normal data.
- Handling missing data: If the percentage of missing data is small
and that data missing is a random sample of the data set,data from the
observations with missing data can be analysed with
lmer(unlike other packages that would do list-wise deletion). - Takes into account variation that is explained by the predictor variables of interest i.e. fixed effects and variation that is not explained by these predictors i.e. random effects.