Statistical Analysis:Part 2 - Simple Linear Regression

Introduction

Within this tutorial we will first look at the basics of statistical modelling in R with a straightforward simple linear regression.

In subsequent workbooks you will be able to explore extensions into other common modelling approaches often used in agricultural analysis. Including;

An Analysis of Variance (ANOVA) model
Multiple Linear Regression
Modelling non-continuous data (logistic)
Extending the ANOVA model to a generalized linear model

You will quickly see that the code required in R for all of these modelling approaches is very similar. In fact all of these models are all part of the same general modelling framework - this extends to nearly all of the common models that you could ever want to fit.

In the video above I explain more generally about models, beyond just the simple straight line linear regression model. The simple straight line linear regression model is nice, and has lots of useful applications.

But in many cases won't be sufficient to build a comprehensive model to understand our questions and data in full. Although the statistics underlying more advanced methodologies can start to become increasingly complicated, the R code to produce these models does not increase in complexity to anywhere near the same extent! This means that with R we should be able to learn how to analyse our data according to our objectives, rather than trying to use approaches simply because we have already been taught about them or because they are 'easy'.

It is important to match our questions, and our data, to the model that we are using. A model is useless if it does not let us answer the questions we have, or if it is not appropriate for the data that we have. So do not be limited by only using models you 'know' or have been taught about - try to find the right sort of model for you. If you are stuck with this process, reach out to someone to ask them for help in giving you advice! But the nice thing is, that using R means that actually fitting this model may be very straightforward indeed once you know what you are looking for.

Data

The data used in this example is from a study was conducted in Eastern Zambia with the main objective of improving the efficiency of the natural fallows by using appropriate trees that may have relevance in soil fertility regeneration within permissible fallow periods.

The data has been read into R and assigned to an object called fallow.

The design was a RCBD experiment (randomised complete block design) with 4 blocks and 9 treatments. If you take a look at the data below you will see that we have data concerning the measurement of the yield (the column yield within the data) and the striga count (striga) of a plot (the column plot), within a block (the column rep), after a specific treatment (the column treat).

Packages used

Across these next few tutorials we will use four new packages: emmeans, multcomp, lmerTest and MuMIn; in addition we will use some of the packages we have used regularly so far in other available modules dplyr (for data manipulation) and ggplot2 (for data visualisation).

Linear regression model

The first type of model we are going to investigate is a simple linear regression model, exploring whether there is a relationship between yield and striga from our data.

We are assuming that you have come across the topic of linear models before. Even if you haven't, there is a nice video from Crash Course Statistics which provides a nice refresher of the methodology.

Remember the first thing we should do is to explore our data and look at some summary statistics and plots!

Exploratory data analysis

When considering summary statistics there are two things we could do. Firstly we could calculate a correlation, using the cor function. To use this function we need to specify the two variables with which we want to calculate the correlation.

cor(fallow$yield,fallow$striga)

A value of -0.339 suggests a small-to-moderate size negative correlation between these variables. But correlation is not an especially informative value - fitting a linear model provides us with the same, and much more, information!.

We could also calculate summary statistics of yield based on categorising the striga variable and then calculating means within these groups. We could use mutate to create a new variable, and within mutate() use the cut() function to break striga into multiple groups. Using this new variable we can then group_by() and summarise() the yield values. The cut function requires us to set the break points for our new categories - note that the lower limit is exclusive by default, so we need to make sure we include the option include.lowest=TRUE to make sure the 0 striga counts are included in this group.

To learn more about data manipulation, please refer to our guide here

fallow %>%
  mutate(striga_grp=cut(striga,breaks=c(0,25,250,9999),include.lowest = TRUE))%>%
    group_by(striga_grp) %>%
      summarise(mean(yield),n())

As we move through the three categories we can see a reduction in the mean yield - from 3.6 t/ha in the lowest striga count group, to 2.88 t/ha in the highest striga count group.Exactly where to put the break points in cut and how many groups to choose is a little arbitrary. I chose some fairly round numbers and went with three new categories - from 0 to 25 striga counts, from 26 to 250 striga and then more than 250.

Plots

Let's look at the 'obvious' plot when considering this sort of relationship between two continuous variables - a scatter plot of striga against yield. We will put yield on the y axis and striga on the x axis. This is because we are considering yield to be our response variable and striga to be our predictor variable. We think it is more likely that an increased striga infestation will result in reduced yields rather than the other way around (although there is some debate over this!).

See if you can write the code to produce the plot below using the appropriate geom from ggplot2. Feel free to check the solution if you have not taken the modules on data visualisation or are unfamiliar with ggplot2

ggplot(data=fallow,aes(y=yield,x=striga))+
  geom_point()

To help us understand this relationship better we can also add an additional layer using geom_smooth which will fit a trend line. By default this fits a 'generalised additive model' which is similar idea to a moving average. However, we can also use the function to provide the simple linear regression model of a straight line by using the option method=lm.

ggplot(data=fallow,aes(y=yield,x=striga))+
  geom_point()+
    geom_smooth(method="lm")

It looks like a straight line is probably a reasonably sensible way of fitting the relationship here. We do have lots of points with low striga counts - but as we increase the striga count the trend in yield seems roughly linear and decreasing. While most observations have low striga counts, and among those observations the variability in yield is high, those few observations with higher striga counts all seem to have relatively low yields.

Fitting the model

To examine more about the model that we have plotted using geom_smooth we need to formally fit the model in R. We do this using the lm() command. Note the function we are using is written with the letter l not the number 1. "lm" stands for "linear model". The syntax for this is very similar to what we saw in the previous module for t-tests, with a response variable, a tilde, then the explanatory variable. Then the name of the dataset after.

lm(yield~striga,data=fallow)

The output here only tells us two things:

"Call" - simply repeating back the model we have specified
"Coefficients": Telling us the values of the parameters

A linear regression follows the equation of a straight line y = B0 + B1x. You may have learnt this same equation as y=a+bx or y=mx+c ; depending on where and when you went to school. There is then an additional term \(\epsilon\) , which represents the residual variability in this relationship. If all the points were to sit perfectly on the line then there would be no need for an \(\epsilon\) term, but in reality we always have underlying variability that needs to be accounted for.

The coefficients give us the value of our intercept (B0): 3.43 and the value of our slope (B1): -0.00059

So the overall model would be:

yield = 3.43 - 0.00059 * striga count + \(\epsilon\)

The value of the intercept is interpreted as the expected value of the response variable where all explanatory variables are equal to zero. So for a plot with no striga, we would expect to see an average yield of 3.43 t/ha.

The value of the slope represents the change we expect to see in the response variable for a one unit increase in the explanatory variable. So for additional striga observed, we expect an average reduction in the yield of 0.00059 t/ha.

The output at this stage has not told us anything about the error term, \(\epsilon\).

However we can get much more information from our model than this! R is a little different from many other software packages, which will produce a lot of different outputs when you create a model. In R you have to ask specifically for what you want out of a model.

But first we usually need to save the model to an object. I am choosing to give it the name linreg_mod.

linreg_mod<-lm(yield~striga,data=fallow)

Summarising the model

summary() provides us with a lot of useful information in a customised output format containing model fit statistics, standard errors and p-values for the coefficients.

summary(linreg_mod)

One of the outputs from summary() allows us to complete the full linear regression model, as it provides the value of sigma, also known as the residual standard error. Within our error term, \(\epsilon\), this is the standard deviation of the residual values. Because we are doing a simple linear model we assume our residuals are normally distributed with a mean of 0. We will check if that assumption makes sense in a short while.

Overall the information tells us that the relationship between striga count and yield is significant, p=0.043.

The intercept is also significant. But this is almost certainly not of interest. The null hypothesis which generates this p-value is that the intercept is equal to 0. In other words the null hypothesis is that the expected yield value when there is no striga (0 t/ha). It is not really surprising to see very strong evidence against this nonsensical null hypothesis!

If we look back at the output of summary() we can also see that 11.5% of the variability in yield can be explained by the linear increase in striga (Multiple R- Squared / r.squared). You can read a little bit about r-squared, and why the adjusted r-squared is also a useful metric to consider here.

Like with the t.test function, the data argument comes after the formula. So if we were to use a pipe from data into lm() we would need to specify data=. within the lm function. However - we can pipe out of the model into the later functions, which can be pretty useful. Although we do often want to save the model objects rather than use pipes, because computing the model can take a little bit of time if we have a large dataset or a complicated model. And we often want to do lots of different things with our model, rather than a continuous pipe so if we use pipes the whole way the model has to be recomputed every time we run the code. If we store an object it would run much faster.

fallow %>%
  lm(yield~striga,data=.) %>%
    summary()

Checking Model

We should also check our residual plots to assess model validity. It is worth recapping, or learning how to interpret these plots, as it can take some practice to know what you are looking for. "Statistics by Jim" has a nice overview: https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/

plot(linreg_mod)

These plots allow us to check a number of different key assumptions related to the linearity of the trend (is there any pattern in the residuals vs. fitted plot?); the homogeneity of variance (is there any pattern in the scale location plot); the approximate normality of the residuals (do the points in the Normal QQ plot roughly follow the line); and the existence of outlying or high leverage points (are any standardised residuals >3 or are any points outside the funnel in the residuals vs leverage plot).

In this case all four of those checks seem to be satisfactory.

However - there is a much more important consideration which cannot be assessed from these plots - are my points all independent? In this case the answer is certainly "No" which does make the model not valid. This is because when we are producing the model here we have completely ignored the structure of the experiment - we have not considered the different treatments (which will form one source of dependence) and we have not considered the blocks used in this experiment (which will form another source of dependence).

So the model we have produced is not going to be especially valid for a comparison of yield and striga counts! In upcoming tutorials we will look at improving this model. But the code we have used will be useful for when you do have to produce linear regressions!

You can move onto the next tutorial to have a first look at an ANVOA model