Statistical Analysis: Part 4 - Multiple Linear Regression

Introduction

In another tutorial we had a first look at an ANVOA model. This tutorial will build upon those lessons as well as extending simple linear regression into multiple linear regression.

Multiple linear regression allows us to extend our models to include multiple explanatory variables, both continuous and categorical. The coding building blocks are all the same with simple additions that allow us to inspect our data with a wider scope.

Dataset

We are first going to use a dataset which has been modified from a larger dataset available on Kaggle, a data science repository with datasets, courses and forums.

The data in particular we are using comes from a survey of over 9,500 households across 11 African countries (however for data purposes we have cut this down to 3). The countries included in our analysis dataset are Burkina Faso, Niger and Senegal totalling 1117 households. This survey was conducted across the growing seasons from 2002 to 2004 in an attempt to learn more about the resilience and adaptability of African farming households to short and long term climate changes and extreme weather events.

This is quite a large dataset so we have selected only a few columns to help with this demonstration.

We will be using yield as our response variable, this is the expected millet yield in an average year as reported by the farmers. Along with this we have their country, the total number of plots they use and the primary farming system utilised by the farmer. This farming_system variable concerns the fallowing practice of the farmer and is defined as either no fallow, short fallow or long fallow.

Column	Description
hhcode	Household ID Code
country	Country (Burkina Faso, Niger, Senegal)
yield	Yield in KG/HA
nplots	Number of Plots
farming_system	Primary Farming System (No Fallow, Short Fallow, Long Fallow)

Feel free to use the table below to explore the data more closely.

Exploratory Analysis

First let's start by conducting some exploratory analysis. Now were are interested in the impact on yield that a few different variables in our data may be asserting. Our key variable of interest is farming system. This variable details the farmers preferred method of fallow (none, short or long). Farming practices are likely to have an important role in determining yield of a crop so it is key to our research objectives.

But we have decided we also need to consider other variables as this is unlikely to be the sole determinant of a farmer's yield. So we have also kept variables such as the country of the farm and the number of plots the farmer totally owns.

Farming System

First, lets look at farming system on its own.

ggplot(millet_yields, aes(x = farming_system, y = yield))+
  geom_boxplot()

Would seem that short fallow results in the highest yields on average, followed by long fallow and then no fallow at all is the lowest. But we should calculate some summary statistics to accompany this plot

Question: Calculate means and standard deviations for farming systems

millet_yields%>%
  group_by(farming_system)???

millet_yields%>%
  group_by(farming_system)%>%
  summarise(mean = mean(yield),
            sd = sd(yield))

This further shows that short fallow is related to highest yields, averaging a yield roughly 180kg/ha higher than no fallow and 160kg/ha higher than long fallow.

However, we are dealing with data across 3 different countries and location can have a BIG role in determining outcomes such as yield. On top of this the rates of fallow systems are unlikely to be consistent either, e.g. Senegalese farmers may be more likely to practice long fallow than their counterparts in Niger or Burkina Faso.

Question: Produce a boxplot of yield by farming system, splitting them on country using facet_wrap

ggplot(millet_yields, aes(x = farming_system, y = yield))+
  geom_boxplot()+
  facet_wrap(~country)

So there seems to be quite some differences across countries. While Burkina Faso seems to maintain the overall pattern we saw. In Niger, long fallow is actually lower than no fallow and the gap up to short fallow is much smaller.

Now as mentioned, the rates of fallow are likely very different in each country too and this simple fact may be related to the patterns as we observed.

millet_yields%>%
  group_by(country,farming_system)%>%
  summarise(mean = mean(yield),
            sd = sd(yield),
            n = n())

Let's plot those frequencies to maybe get a clearer picture

millet_yields%>%
  ggplot(aes(x = farming_system))+
  geom_bar()+
  facet_wrap(~country)

So the frequencies of the farming systems are seemingly heavily related to country. No fallow is incredibly common in Niger and is also the most frequent category in Senegal though to a lesser extreme. While in Burkina Faso, short fallow is most common.

Country

As already alluded to, location is likely to have a significant impact on yields due to many different reasons, climate, resources etc.

Therefore we should take a closer look at how this varies with a boxplot.

ggplot(millet_yields, aes(x = country, y = yield))+
  geom_boxplot()

Senegal and Burkina Faso actually have very similar variabilities and medians it would appear. Niger however has a far lower median than both other countries. Also the 75th percentile (top of the box), is lower than the 25th percentiles (bottom of the boxes) of both Burkina Faso and Senegal.

Exercise: Calculate the mean and standard deviations of yield by country

millet_yields%>%
  group_by(country)%>%
  summarise(mean = mean(yield),
            sd = sd(yield))

So this fairly matches up with what we may have expected based on our boxplot. Burkina Faso and Senegal are near identical while Niger has a far lower average yield but much higher variability.

Niger being so low may also help us explain what we were seeing with our fallow practices. If you remember from the first boxplot, short fallow had the highest average yield but we also saw that short fallow was a rare practice in Niger. In other words, short fallow has highest yields overall – but short fallow is rare in Niger, which has low yields – so a large part of the reason why the short fallow mean is so much higher is because there is less data from Niger.

These are the sort of relationships we can model when using multiple linear regression as we can account for multiple influences that may or may not be independent of one another. If we were to use numerous simple linear regression model, than the variables can only be treated as being independent of one another.

Number of plots

One such variable is the number of plots the farmer has, it's possible that this could impact farmers in different ways. For instance, more plots may result on greater strain on available resources (i.e. use less fertiliser as it needs to cover a greater area), or conversely it could mean availability of more resources due to the financial impact of being a larger farm.

ggplot(millet_yields, aes(x = nplots, y = yield))+
  geom_jitter()

There doesn't seem to be much of a consistent pattern based on this plot. However, note that a jitter plot may distort the patterns slightly, we have to use a jitter plot here however due to nplots only taking values of 1 to 4.

we can try to make a little bit more sense of this graph by adding a trend line using geom_smooth using the method = 'lm' argument. This will fit a simple linear regression line using our two variables.

ggplot(millet_yields, aes(x = nplots, y = yield))+
  geom_jitter()+
    geom_smooth(method = 'lm')

So hidden within our plot is a slight positive trend in which yield increases as the number of plots owned increases.

Perhaps, we can see something else if we were to look at the patterns within countries. When we come to modelling we will be using both variables, so it makes sense to see how these variables all interact with one another.

Exercise: Take the previous code and add split the plots by country as well as a linear regression line

ggplot(millet_yields, aes(x = nplots, y = yield))+
  geom_jitter()

ggplot(millet_yields, aes(x = nplots, y = yield))+
  geom_jitter()+
  facet_wrap(~country)+
  geom_smooth(method = 'lm')

So we have quite a different pattern emerging within as opposed to across countries. In Burkina Faso, there is a clear downward trend instead. while there is virtually a straight line relationship in Senegal and Niger. How can this be?

Well if once again we take a look at those the number of observations in each grouping. We can see that those farmers with 4 plots were predominately based in Senegal (148) instead of Burkina Faso (30) and Niger (22). Yields are generally higher in Senegal so the country composition of our plot groups was dragging that line upwards at the 4 plot mark.

millet_yields%>%
  group_by(country,nplots)%>%
  summarise(n = n())

Now we have finished exploring we can move onto creating our model.

Fitting the Model

Let's build our model step by step. Starting with the nplots variable.

You should be familiar with this code set up if you followed the previous tutorial. We use the lm or "linear model" function using our outcome and explanatory variable separated by a tilde ~. Then specify our dataset.

model1 <- lm(yield ~ nplots , data = millet_yields)

When saving the model, you will see no output. Rather we can then use summary to take a look at the details of our model.

summary(model1)

So number of plots by itself does have a significant effect on yield. For every additional plot, we can expect an average increase in millet yield of 40.5 kg/ha. There are other statistics available in our summary such as our R-squared which is 0.037, this is interpreted as our model (nplots only) explaining 3.7% of the variation in yield.

We can ignore the intercept in this case as this is not theoretically useful (the average yield when you have 0 plots?)

This trend however is of course what we saw in our exploratory analysis, remember that geom_smooth line that showed a positive relationship. Well this is essentially the model output for that line. We know from our additional analysis that this trend is largely driven by Senegal however, so let's add country to our model next and see what happens to the term for nplots

Adding Categorical Variables

When adding in further variables we simply add in them with addition signs +. This is regardless of the variable type.

model2 <- lm(yield ~ nplots + country , data = millet_yields)

summary(model2)

When looking at the summary you will see a term for each level of the categorical variable, except for the reference level. R will automatically pick the first group alphabetically as the reference group. This can be changed prior to modelling by using functions such as factor or relevel to specify a specific order to your categories or a specific reference level. For now, we will keep Burkina Faso as our reference group.

The reference level is the group to which the other groups are being compared. In other words, our estimates for Niger and Senegal are the relative differences compared to Burkina Faso. Based on our model we would expect yields in Niger to be 269 kg/ha lower than in Burkina Faso, when holding number of plots to be the same. For Senegal we would expect them to be 4 kg/ha higher though this is not a statistically significant difference.

Something you will notice is that the term for number of plots term has actually become negative compared to the first model, reflecting what we saw in the exploratory analysis, whereby we only really saw a positive trend when the data was for all countries. Once we split the data by country, this trend was gone. It is also now non-significant. Suggesting once we account for country, the number of plots is no longer substantive in explaining variation in yield.

Note that we have increased our R-squared substantially up to 33.5%

Finally, let's add farming system in the same manner. BUT, first actually I think we should use a better reference level than what R is going to pick. Because it will go alphabetically, 'Long Fallow' will be the default. This is not intrinsically a bad thing, but logically it would make more sense for 'No Fallow' to be the reference. We are then logically comparing the additional levels of fallow to its absence.

We can do this using factor to reorder these groups. To do this we use the levels = argument and supply this with a list using c() of the new order of all the levels.

millet_yields <- millet_yields%>%
  mutate(farming_system = factor(farming_system, levels = c("No Fallow", "Short Fallow", "Long Fallow")))

You will not see any output as usual when making this change, you could use a function such as levels(millet_yields$farming_system to check. This will print out the levels of the factor variable, in order.

If your variable has far more levels than just the three we have here, you can use relevel to explicitly just make one group the reference.

Now we have that sorted, let's add this term to the model.

model_full <- lm(yield ~ nplots + country + farming_system , data = millet_yields)

summary(model_full)

As it was the first group alphabetically long fallow has been chosen as our reference level. When having already accounted for country and number of plots, long fallow is actually 32 kg/ha lower than no fallow on average. This is the reverse of what we saw when looking at this variable in isolation. However, though approaching statistical significance, it does not quite reach this point.

To consider why this might be the case, remember back to our exploratory analysis that no fallow is far more common in Niger where yields are consistently much lower than in Senegal or Burkina Faso. Therefore, the yields of the no fallow group are being weighed down due to the country composition. Once we separate out that country effect, a new pattern emerges. This is a good example of why we use multiple linear regression rather than various simple linear regressions, the latter cannot account for the ways in which variables work with one another and therefore can supply misleading results.

Short fallow on the other hand is associated with a yield 59 kg/ha higher on average, holding country and number of plots constant. This is a statistically significant term to our model.

We can go one step further however, when looking at our summaries we considered how farming system and number of plots may vary in their patterns across countries. We can check this in a model using interactions.

Adding Interactions

Interactions help us to allow the effect of one variable to differ according to the values of another. The two variables become intertwined within our model. These can be very useful to statistical modelling to help out draw patterns that would otherwise be missed if we assumed total independence of our effects.

If we think back to our exploratory analysis we saw a few different connections between our variables at play. We saw how the number of plots had a negative relationship with yield in Burkina Faso, while there was an approximately straight line in Senegal and Niger. While short fallow seemed to be related to higher yields in Burkina Faso and Niger but there was virtually no variation by fallow in Senegal. The patterns are rarely consistent and we need to consider this.

We can fit an interaction by replacing the + with an asterisk *. Let's first fit an interaction between country and the number of plots.

interaction_model_1 <- lm(yield ~ country * nplots +  farming_system , data = millet_yields)

summary(interaction_model_1)

We now have an additional two terms within our model summary. One for each possible interaction combination (N plots X Niger and N plots X Senegal). At this point our interpretations become a little more complex as we have to start considering how our coefficients work with each other.

For instance, in our reference group Burkina Faso, we focus on the main effect of nplots. So in Burkina Faso, yields decrease by 28 kg/ha for every additional plot. while in Niger, we add the additional impact of the interaction so for every additional plot in Niger there is a "-28.32 + 25.49" change in yield, or in other words a 2.83 kg/ha decrease in yield. Finally, in Senegal we have a "-28.32 + 28.60" or 0.28 kg/ha increase in yield for each additional plot. All of these comparisons are assuming the same farming system. We will see how we can visualise this pattern towards the end of the tutorial.

Alternatively let's fit a categorical X categorical interaction between numbers of plots and country. We can do this using the same basic code just swapping where we put the asterisk.

interaction_model_2 <- lm(yield ~ nplots + country * farming_system , data = millet_yields)

summary(interaction_model_2)

We now see an additional 4 terms, one each for the possible combinations of fallow and country excluding our reference levels of Burkina Faso and long fallow as these are represented by our intercept.

So now, when holding the number of plots at a constant, farmers in Niger using long fallow have a yield "-208.483 - 17.693 - 39.543" higher than a farmer in Burkina Faso using no fallow, or 265.72kg/ha lower on average. We can do the same for Senegalese farmers using short fallow, they have a yield "51.814 + 112.041 - 88.554" or 75.3kg/ha higher on average than farmers from Burkina Faso using no fallow.

We can go further and include both interactions as well, though we want to be careful as the more interaction added, the more complex the model. We should always aim for a "parsimonious" model, one that is useful and informative without being unnecessarily complicated.

We can do this by separating our two interactions with a + but we will need to write the variable we are using in both twice. If we had taken the previous code and instead wrote nplots * country * farming_system we would have fitted a three-way interaction which, while potentially useful to explore, would greatly add to the complexity. There will be cases where such an interaction may be appropriate but for now, we can leave this aside.

interaction_model_3 <- lm(yield ~ country * nplots + country * farming_system , data = millet_yields)

summary(interaction_model_3)

We now have a model with both interaction terms, again we would need to work out the maths in order to make specific comparisons. For now let's focus on the significance of these interactions. Only one term of our interactions is significant (Short fallow in Senegal), so we know this is an important term and suggests this interaction should be in our model.

However, this does not tell us if the interactions are significant collectively, likewise even our categorical variables fail to do the same. We can see that Niger is significant, but this doesn't tell us if country is significant as a whole. We can though use some techniques to check the model a little further.

Testing the interactions

In order to test for the significance of our interactions and categorical variables, we can actually use the anova function that we have used in other tutorials. This will output an analysis of variance table using our model as an input, detailing the significance of each variable.

anova(interaction_model_3)

So according to our ANOVA table, country, farming system and their interaction are all statistically significant additions to our model. However, the interaction between country and number of plots is not.

anova is a very useful function as we can actually supply it with two models to test if one is significantly better than the other using an F-test. We need to make sure that the first model is "nested" within the second model. In other words, every term in model 1 is also in model 2. Thereby we are testing if the additional terms in model 2 are explaining a significantly greater amount of variation than model 1.

anova(interaction_model_2,interaction_model_3)

We do not have a statistically significant result, suggesting the best model would actually be interaction_model2 which drops the nplots * country interaction.

Checking the model

So finally, let's check our assumptions using the second interaction model.

plot(interaction_model_2)

In our first plot we are plotting the residuals against the fitted values as they are calculated according to our model. This to check for a linear trend, so we want to see a straight line through the middle of the plot. We satisfy this need.

Our second plot fits the standardised residuals against their theoretical quantiles, basically we are testing are the residuals normally distributed. If they are, then the points will fit approximately along the diagonal line. There are some slight deviations at the ends of the line but these are not enough of a problem to reconsider the model.

In the third plot, we are testing homoscedasticity or in other words that variance in the residuals is constant. Again we want to see a straight line through the middle. There is a slight decreasing curve at the lower end of the fitted values though this tapers off back to a straight line. Again, not a big enough violation to be concerned about.

Finally, the last plot checks for any individual observations having a large effect on the results that could be a source of bias. We don't want to see high leverage values or standardized residuals. We have a few points that have been specifically marked by the plot due to the standardised residuals but ultimately 0.02 is a fairly low leverage. So again, I think we can be satisfied that the assumptions of the model have not been violated.

Estimating from the Model

We can use the same emmeans function to calculate and plot our estimated marginal means from the model. Note that we can use more than one variable now in our specification by chaining them together using +. We do not need to use an asterisk despite the interaction, though this would produce the same result. The interaction is accounted for either way.

emmeans(interaction_model_2,~farming_system+country)

Here we can see those patterns of little variation in Senegal, but higher yields for short fallow in Burkina Faso and Niger.

We can also specify different levels of a continuous variable for estimations by using at = list() argument. Within list we can specify values of one or more continuous variables to estimate. In this case we can use c(1,2,3,4) since these are the numbers in our data.

In this case however we are going to use interaction_model1 to so we can visualise that interaction.

emmeans(interaction_model_1,~nplots+country, at = list(nplots = c(1,2,3,4)))%>% data.frame()

We are actually able to move into our usual ggplot code from these emmeans functions as we are able to create a standard dataset from them using data.frame()

emmeans(interaction_model_1,~nplots+country, at = list(nplots = c(1,2,3,4)))%>%  data.frame() %>%
  ggplot(aes(y=emmean,x=nplots,ymin=lower.CL,ymax=upper.CL))+
    geom_point()+
      geom_errorbar()+
        facet_wrap(~country)

We can see that pattern in gradients we detailed in our interpretation, Burkina Faso decreases with each additional plot on average while there are much more minimal changes in Niger and Senegal.

Exercise: Using the code from the first chunk in this section, pipe this output into a data frame and use this to also plot the means and confidence intervals for the relationship between farming system and country.

emmeans(interaction_model_2,~farming_system+country)

emmeans(interaction_model_2,~farming_system+country)%>%
  data.frame() %>%
  ggplot(aes(y=emmean,x=farming_system,ymin=lower.CL,ymax=upper.CL))+
    geom_point()+
      geom_errorbar()+
        facet_wrap(~country)