Introduction
In this module we will talk about statistical modelling, by referring to one of the simplest statistical models as our main reference point - looking at how one numeric variable x
relates to a different variable y
.
This model, commonly known as a linear regression, is part of a much larger family of models, which share a lot of common DNA. As far as the underlying formulae, the practicalities of applying and interpreting the results, and the underlying assumptions required of your data then simple linear regression, t-tests, ANOVA, ANCOVA, multiple linear regression (and many others beyond this) can all be considered to be identical - a "general linear model". I will explain more in the video:
So this week we will be looking into simple linear regression models and extend it into "general linear models", and then in the next module we will look at lots of different ways we can extend this even further where the assumptions underlying the general linear model no longer hold.
Data used in this session
The data used for this session is the same one that we saw back in the Exercises for the Exploratory Data analysis module concerning earthquakes near Fiji.
The data contains the co-ordinates, the depth of earthquake, the magnitude of the earthquake on the Richter scale, and the number of stations reporting the earthquake. Although we do not have the date of the earthquakes we do have the order in which they occurred.
This is one of the Harvard PRIM-H project data sets. They in turn obtained it from Dr. John Woodhouse, Dept. of Geophysics, Harvard University.
Why do we need a model anyway?
The hypothesis tests covered in Module 3 work under extremely specific circumstances; where we have data from one (or maybe two) variables collected and we want to test an extremely specific hypothesis - e.g. "the mean magnitude of earthquakes in the area close to Fiji is the same as the mean magnitude of earthquakes in the area close to New Zealand".
But reality is rarely that simple! We probably will have collected or obtained a lot more data than just the variables than can fit into a simple hypothesis. And it is likely you will want to try to understand the overall system, rather than looking only at a very specific comparison.
When thinking in terms of a model - the first step will be to separate out our two variables into two categories. This is not based on what "sort" of variables they are (numeric/categorical/date etc.) but based on what role they will play in our model:
- "Outcome"/"Output"/"Dependent"/"Response"/"y" variables
- "Predictor"/"Input/"Independent"/"Explanatory"/"x" variables
(I have given a few different options above for terminology, as this is an area where seemingly everyone has different preferred terminology to represent exactly the same thing!)
We want to know to what extent our x
variables influence our y
variable. For example if you have data concerning soil quality measurements where you had different treatment groups, collected from different locations, and with data about the altitude and climate of those locations. In this case the soil quality measurements are clearly the y
variable here - we would want to know how much the altitude, climate, treatment, and location combine to influence the quality of soil. It would not make so much intuitive sense to think of how the quality of soil may affect the climate; or how soil quality may have affected the choice of treatment.
A general linear model uses the combination of our x
variable(s) to construct some sort of a mathematical equation which can best explain the variability in the y
variable(s).
And once we have a model we will be able to:
1. Quantify the extent that our independent variable(s) are related to our response variable(s), and understand the way in which these independent variable(s) are linked to the response.
2. Summarise the extent to which our model cannot explain the outcome variable - through exploration of the residual error term.
3. Test key hypotheses of interest, which account for multiple sources of variation, & can adjust for structural elements of the research design
4. Predict results for future observations, and quantify the level of uncertainty in these predictions
However our model will only be able to do this if it meets the assumptions required for it to be valid.
Simple linear regression
For a "simple linear regression" we are interested in how changing the value of one "independent" numeric variable will influence the values of our "dependent" numeric variable.
There is a nice video from Crash Course Statistics here which provides a refresher of the methodology specific to linear regression
When we explored the quakes
dataset before we were interested in the relationship between how many monitoring stations reported the earthquake, and how large the magnitude of the earthquake was.
One of the most important decisions is now to decide which of these variables should be the "dependent" variable and which should be the "independent" variable. Do we want to have a model which says:
- "As the number of stations reporting an earthquake increases; we would expect there to be a change in the magnitude of an earthquake"
- "As the magnitude of an earthquake increases; we would expect there to be a change in the number of stations reporting the earthquake"
There is a clear temporal relationship between these two variables: the earthquake happens and then each of the stations either do or not report it. An increased number of stations reporting the earthquake cannot influence the magnitude of the earthquake, as the earthquake will have already happened! But an increased magnitude of earthquake will plausibly influence the number of stations reporting.
Let's first of all explore our data, or remind ourself of our previous explorations, with magnitude
, our "predictor"/"independent"/"input" variable on the "x" axis; and stations
our "response"/"dependent"/"outcome" variable on the "y" axis.
Looking at this graph, we can see a pretty clear trend - as magnitude is increasing the number of stations is also increasing.
If you followed through the exercises in Module 1, you will perhaps remember that we already noted some fairly deviations away from drawing a simple straight line - you can see from the interactive graph above the difference between a straight line, and a smoothed line as drawn on this plot.
But for now - let us temporarily ignore those concerns, and see what happens when we work through an example of fitting a linear regression onto this data. In the next module we will explore what to do when the assumptions regarding a general linear model do not hold!
Correlation
As a quick reminder we can also calculate a correlation coefficient as part of our exploration process; and we can also create a statistical hypothesis test around this value as to whether the level of correlation is statistically significant. Our null hypothesis would be "The correlation coefficient is zero". "i.e there is no correlation between magnitude and number of stations reporting"
cor.test(quakes$mag,quakes$stations)
##
## Pearson's product-moment correlation
##
## data: quakes$mag and quakes$stations
## t = 51.231, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8331527 0.8674048
## sample estimates:
## cor
## 0.8511824
The (Pearson) correlation is pretty strong, 0.85. So, unsurprisingly, the hypothesis test tells us that the we have some very strong evidence against our null hypothesis, with a t statistic of 51.23 and a p-value so small that when we run the test using R it does not even try to report it in full.
Remember those numbers, as you may see them crop up again later!
But simply looking at the correlation coefficient doesn't tell us that much, as compared to what we can learn even from a simple statistical model like a linear regression.
Fitting a line to the plot
You probably learnt something at school about the idea of 'putting a line of best fit' through your points when you see a scatter plot.
But if we draw scatter plots by hand and use a ruler to draw a line, we will end up with a subjective decision on exactly where to place the line. The linear regression model uses a technique called "ordinary least squares", or OLS, to determine the "best" possible 'line of best fit' through the scatter plot. This is under the assumption that the "best possible" line of best fit is a single straight line
One of the most important concepts in statistical modelling is the concept of the residual value from the model.
This is the difference between the observed value of y
and the expected, or predicted, value of y
. Every data point will have it's own individual residual value - is equal to the observed value of y minus the expected value of y. If the line perfectly hits the point then the residual value will be, approximately, equal to zero. As, in this, case the observed and expected values are almost identical.
If the point is a long way above the line, the residual value will be positive. On the plot below we can see clearly one earthquake with a magnitude of 5.0 and 100 stations reporting. Based on an earthquake of 5.0 we would expect to see only around 50 stations reporting, therefore this point has a high residual value +50. Similarly points a long way below the line will have low residual values.
The "ordinary least squares" method works by considering every possible straight line which could be drawn onto the scatter plot, and then determining which of those lines would minimise the sum of the squared 'residual' values from the data.
We "square" the residual values - because there will be a mixture of positive and negative values that we want to account for equally. The algorithm determines the line which gives the sum of these squared residual values that is the "least". The "ordinary" part of the name was added a bit later on, once enough people had started to modify this algorithm in different ways to fit slightly different purposes!
Turning a line into an equation
So let's dive in to what this model is numerically, instead of just visually. And in fact it will look like a formula you have probably seen before in a different context.
You will have almost certainly been taught the "equation of a straight line" when you were at school. Depending on when & where you went to school this may have been in either the form of y=a+bx
or y=mx+c
.
In either case there will be an equation with two variables, y
and x
, and two parameters - an "intercept" (a
/c
) and a slope (b
/m
).
The ordinary least squares algorithm determines the values of these two parameters for the 'best' straight line that links together x
with y
. But when dealing with statistical models we will change the notation slightly.
So that our model notation can be consistent for a simple straight line, with just two parameters to be estimated, but also a much more complex multi-dimensional model with hundreds of estimated parameters, we need to allow for a bit more flexibility otherwise we may run out of letters to use algebraically. So we instead consider the totality of all of our estimated parameters to be a vector called β
. And within β
lies each of our individual parameters to be β0, β1, β2 and so on.
For a simple linear regression we have two parameters - β0 is used to denote the 'intercept' and β1 to denote the 'slope', so the estimated equation becomes
y = β0 + β1x
Which is the same as the equation of the straight line, after doing a little substitution of letters!
There is one additional component to add in - the error term. The model will never be 'perfect', so we also need to account for and quantify the deviations from the trend; the residual values that we mentioned earlier.
We will use \(\epsilon\) to denote this term.
y = β0 + β1x + \(\epsilon\)
This error term is where three of the most important assumptions of the linear regression model come in - we assume that:
1. The individual errors are all independent of each other
2. The error is not dependent on the value of x
or y
; i.e. it is the same throughout the range of observed values
3. The errors follow a normal distribution
We will come back to explore these assumptions later. For now, let's assume we have no problems with the assumptions, and take a look at the computer output for the model.
It will look a little different in whatever software you use, but the model itself will be the same.
##
## Call:
## lm(formula = stations ~ mag, data = quakes)
##
## Coefficients:
## (Intercept) mag
## -180.42 46.28
The basic output from fitting the model in R shows us our estimated parameters, which it refers to as coefficients:
β0 = -180.42 β1 = 46.28
So we can map these up into a formula that represents the overall model:
y = -180.42 + 46.28 * x + \(\epsilon\)
Or alternatively:
Number of Stations = -180.42 + 46.28 * Magnitude + \(\epsilon\)
The value of he mag
coefficient (β1) represents the "slope"; the change we expect to see in y for a one unit increase in x. So comparing two earthquakes, where one had a Richter scale magnitude of 1 point higher than the other, then we would expect 46 more stations to report the earthquake. Because we are dealing with a simple linear regression the slope remains the same throughout the whole range of values and is not impacted by any other variables- it does not matter whether we are comparing two earthquakes of magnitude 4.0 and 5.0; or 4.7 and 5.7; or indeed 8.1 and 9.1.
The value of the intercept is interpreted as the expected value of y where x is equal to zero. You may have been taught something like "the value of the intercept is where the line hits the y axis". But take a look back at the scatter plot with the line fitted, and you will see that "the line hits the y axis" somewhere just above a value of 0; definitely not at -180.42.
So our interpretation of the intercept is that for an earthquake recording 0.0 on the Richter scale we would expect a total of -180 to report this earthquake. For obvious reasons this is not an especially sensible interpretation! In itself is not a problem with the model - extrapolating outside of the observed range of values is extremely dangerous, and generally not recommended. It is more important that the model provides coherent results within the observed range of values (or the "domain" of values).
Summarising the model
The output at this stage has not told us anything about the error term, \(\epsilon\), so let's look at a more detailed summary to explore it further
##
## Call:
## lm(formula = stations ~ mag, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245, Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
R breaks down it's summary into three main components:
"Residuals"
This component produces summary statistics around the residual values. This can help us in two main ways: i) to identify if there are any particularly badly predicted points and ii) to assess the shape of the distribution of the residuals
As we saw in the scatter plots there is one earthquake that was reported by 50 stations more than would have been expected; and another that was reported by 49 stations less than expected. These correspond to our "min" and "max" residual values. If we see any particularly extreme looking values in this output, that will be an indication of some points that really do not fit the overall trend being fitted.
These summaries also help us to assess if there is any skew in the distribution of our residuals - ideally we would want to see a symmetrical distribution with a median of close to zero. The mean value of the residuals will always be equal to zero by definition; this is part of the way in which OLS algorithm calculates the model. So if we see big differences between the absolute values of the min and max; or the absolute values of the upper and lower quartiles (1Q and 3Q), or a median which is a long way from 0 then this is a strong indication that there may be some issues with the model validity as the shape of the distriubtion is not symmetrical.
In this case, there are no immediate red flags from this output - the summaries of the residuals look to be approximately symmetrical, and the median value is pretty close to zero.
"Coefficients"
We have already seen the values of the coefficients that make up the equation of the straight line; but these coefficients are now combined with a standard error value. Therefore based on the residual variability we can calculate the uncertainty in these coefficients, through the standard error, and conduct a hypothesis test around these estimates.
There are two hypothesis tests in this part of the output:
The hypothesis test for the "mag" coefficient is based around the null hypothesis that the value of the coefficient is equal to 0. If this were true this would indicate that the line on the plot would be perfectly horizontal, and as mag
increased the value of stations
would not change.
The hypothesis test around the "slope" parameter is usually of some interest, as this helps us to determine whether or not our x
variable has a significant relationship with y
. But the hypothesis test around the "intercept" parameter is rarely of any interest.
This is testing a much more specific hypothesis - whether or not the intercept value is significantly different from 0. In most cases we are not going to be interested in whether or not the intercept is equal to zero. So, although we can ignore the direct interpretation and the hypothesis test, an intercept term is an essential mechanistic component of any model.
We do always have the choice to exclude it from the model if we wanted to - this would force the intercept to be equal to 0, meaning in a linear regression model that the line would go through the origin (x=0; y=0).
There may be some specific cases where this would be a good idea, if it is a known scientific property that if x is equal to zero then y is equal to zero; and that x=0 is within the observed range of data.
While the first part of that is true here, a non-existent earthquake must by definition be reported by no stations. Looking at the plot below I don't think that it would make any sense within this dataset, where x is always greater than or equal to 4!
"R-Squared & F statistics"
The final component of the output tells us more about the residuals, and how much of the total variation in the y
variable is explained by the model, and performs a hypothesis test on whether the amount of variation in y
that the model can explain is greater than 0.
##
## Call:
## lm(formula = stations ~ mag, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245, Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
The residual standard error is the equivalent to the standard deviation of the residuals. So this gives us some idea of how much our points will deviate away from the line.
The R squared values quantify what percentage of the variability in the y
variable is explained by the model. The multiple R square value of 72%, tells us that 72% of the total variability in the number of stations can be explained by the linear increase in the magnitude of the earthquake here. You can read a little bit about r-squared, and why the adjusted r-squared is also a useful metric to consider here.
And we have actually seen this number before! There is a clue in the name "R-squared"; the Pearson correlation coefficient is often referred to as "rho", the Greek letter R...
cor(quakes$stations,quakes$mag)**2
## [1] 0.7245115
The R-squared value is indeed the Pearson correlation coefficient squared. So there is nothing that "correlation" can tell us, that cannot be also explained through a modelling approach. But "correlation" tells us one extremely specific thing - and building a model can provide us with many different insights, and a framework to build and expand upon.
If this R-squared value is equal to 0% then this means that a) our β1 slope coefficient would also be equal to 0 and b) the "residual standard error" would be the equal to the standard deviation in y
If this R-squared value is equal to 100% then this means that a) our residual standard error would be equal to 0 and b) we have probably done something stupid!
It is very possible to get an R-squared value of 100% for one of two reasons:
- If you perform a regression model where there is a known perfect relationship. e.g. "Temperature in Celsius" as the y variable and "Temperature in Fahrenheit" as the x variable. A slope and an intercept are needed to convert between the two (C= -17.8 + 0.56F), but by definition there is no variability in that relationship.
Although this seems like an obvious mistake to make - it is a suprisingly common one, albeit usually in slightly more subtle ways. If some variables being treated as "independent" x
variables are by definition sub-components of the variable being treated as the "dependent" y
variable.
- You have "over-fitted" your model - and included more parameters than observations.
Let's imagine that we exist in a universe where there have only ever been two earthquakes, so we just have two data points to work with. Try pressing the button to "select two points" multiple times, to see what happens to the results each time.
Whichever two points we select we can always draw a straight line connecting the two points; thus causing all of our residuals to be 0, all of our standard errors to be incalculable, and our R-squared value to be 1.
We have created a model that perfectly explains all of our observations! But it is likely of no use at all when trying to make any interpretations, conclusions or predictions outside of those two individual observations.
The model output here will always look extremely odd - so an error like this is very easily spotted. But it would be much more common to accidentally create a situation that is very similar to this, but not quite so extreme. And that is where the "Adjusted" R Squared comes in very useful.
Let's now imagine that we have three earthquakes, rather than two. Again try multiple simulations to see what happens each time:
With just 3 observations we will always be able to fit a line that is a pretty good fit for at least 2 of the 3 observations.
When you run through different simulations you should spot that the "Multiple R-squared" value usually ends up being very high. The "Adjusted R-square" value however will jump around a lot more: sometimes it will be pretty close to the multiple R square value, where all 3 points fall close to the line, but often it will look extremely different and can even be negative; where the three points cannot be so coherently joined by a line.
The adjusted R-squared formula tries to account for over-fitting and will penalise models that have too much complexity for the amount of data available.
Where you see that these two R-squared values are very different from each other then that is a strong indication that you will have tried to fit a model which is too complicated for your data to support. When the difference between the two values is more than 1-2%, then it is likely there may be potential issues of over-fitting within your model.
Although as you may have seen in the simulations above, the reverse is not quite true - a similar value of the two "r-squared" results, within 2% of each other, does not always guarantee that there is no over-fitting, particularly in extremely small datasets like those we were simulating.
Extending our model
The real beauty of starting with a "general" linear model is how extendible the model framework is. When we constrain ourselves to thinking about "linear regression" or "t-tests" or "ANOVA" we have very tight guidelines on what our x
variables need to look like. As long as our y
variable is a numeric variable, and our model meets the assumptions then it doesn't matter what our x
variables look like. And to build up a comprehensive picture of the overall system we will need more than one variable to be included.
If you have explored this dataset before, you may remember we had two quite clear geographic clusters. Let's refer to them as "West" and "East"
cluster | mean(stations) | sd(stations) |
---|---|---|
East | 32.68931 | 21.64926 |
West | 36.24390 | 22.68149 |
The general distribution of the number of stations seems pretty similar, but it looks like there is a small difference with the box for the "West" cluster shifted up slightly - an average of 36 stations reporting compared to an average of 33 in the Eastern Cluster.
We could do a t-test for this difference; although as I explained in the video for this module - a t-test is just a very specific form of a linear model. So let's assess this difference through a linear model:
##
## Call:
## lm(formula = stations ~ cluster, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.244 -15.689 -6.689 8.311 96.311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.6893 0.7754 42.156 <2e-16 ***
## clusterWest 3.5546 1.7127 2.075 0.0382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.86 on 998 degrees of freedom
## Multiple R-squared: 0.004298, Adjusted R-squared: 0.0033
## F-statistic: 4.308 on 1 and 998 DF, p-value: 0.0382
We do see a small, but significant effect, p=0.038 therefore we can conclude that the difference is significant at the 5% level. The number of stations reporting in the "West" is on average 3.55 higher than in the East. The R-squared value, 0.4% does tell us that this is indeed a very small difference.
So rather than this one small effect - what we should be thinking about more is how this interacts with the magnitude of the earthquake, to help us predict the number of stations.
We can plot this onto our scatter plot, using different symbols for the two clusters:
There are two big questions to be asked here when adding in a variable like this:
1 - Are the lines for the groups distinct? If yes: then it looks like the cluster variable has an additional effect on the outcome on top of the magnitude variable, and the intercept of the line needs to be estimated separately for each group
2 - If yes: are the distinct lines parallel? If no: then there is an interaction between the cluster variable and the mag variable, and the slope of the line needs to be estimated separately for each group.
So in this case it looks like the answer to question one is "yes"; the line for the East cluster is shifted up from the West cluster. And the answer to question two is "yes" - the slopes of the two lines are almost indistinguishable.
These questions are in addition to all of the previous questions we would have considered about the single line regression model; again note the smoothed lines deviate quite noticeably from being straight.
The nice thing about working with a general linear model framework is that adding in terms works very intuitively. This is an additive model, so adding a new variable to formula works in a very similar way:
y = β0 + β1x1 +β2x2 + \(\epsilon\)
Where x1 is the magnitude, and x2 is a binary variable indicating which cluster an earthquake is in. This binary variable will effectively shift the intercept of the line according to the value of β2
When thinking of categorical variables, we generally do not think of the coefficients being "slopes" in the same way we would with a numeric variable. Instead we usually think of this parameter as giving "difference in the expected value compared to the reference level". In this case our "reference level" will be the "East" cluster, and the β2 coefficient will give the difference between the expected value for the "West" and the "East" clusters. This is why we tend to more generically refer to these as "coefficients" rather than "slopes".
##
## Call:
## lm(formula = stations ~ mag + cluster, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.046 -6.895 -0.214 6.334 47.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -184.5410 4.1573 -44.390 < 2e-16 ***
## mag 47.4302 0.9035 52.496 < 2e-16 ***
## clusterWest -5.7926 0.9010 -6.429 1.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.28 on 997 degrees of freedom
## Multiple R-squared: 0.7355, Adjusted R-squared: 0.7349
## F-statistic: 1386 on 2 and 997 DF, p-value: < 2.2e-16
So our new model is:
Number of stations = -184.54 + 47.43 * magnitude - 5.79 * Cluster="West"
Or in terms of the straight line being fitted:
For the East Cluster: Number of stations = -184.54 + 47.43 * magnitude
For the West Cluster: Number of stations = -190.33 + 47.43 * magnitude
We can also interpret our coefficients directly - but this time, and for any model with more than one x
variable, we have to add one important caveat: that we hold all other variables constant:
The "intercept" is -184.5 - this is the number of stations we would expect for an earthquake with 0 magnitude AND located in the East cluster.
The coefficient for magnitude is 47.43 - comparing two earthquakes with a one unit difference in magnitude we expect the number of stations reporting to be 47 higher for the larger earthquake WHERE the earthquakes occurred in the same cluster.
The coefficient for cluster is -5.79 - we expect the number of stations reporting an earthquake in the West cluster to be 5.8 lower than in the East cluster WHERE the earthquakes have the same magnitude.
Other components of the output work in a similar way - the residual values represent the difference between the expected value of y, which is now derived from both of the x
variables, and the observed value of y
. And the R-squared value represents the total variability in y
explained by the two x
variables. Something you might have spotted is that the R-squared value for this new model is 73.6%. The R-squared value for the model with just magnitude was 72.4%, and for the model with just cluster was 0.4%. So the R-squared value of the combined model is greater than the sum of two individual R-squared values. This is not an uncommon occurrence - and again shows how important it is to be able to capture the whole system within a single model.
Simpson's Paradox
But there is something else you might have noticed in comparing the different outputs we have seen so far:
The average number of stations reporting within the West cluster was higher than that of the number of stations reporting in the East cluster. But in our combined model we are concluding that the "East" cluster has higher number of stations reporting than the the "West" cluster. The model coefficient is negative and the "West" line is lower on the plot.
##
## Call:
## lm(formula = stations ~ mag + cluster, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.046 -6.895 -0.214 6.334 47.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -184.5410 4.1573 -44.390 < 2e-16 ***
## mag 47.4302 0.9035 52.496 < 2e-16 ***
## clusterWest -5.7926 0.9010 -6.429 1.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.28 on 997 degrees of freedom
## Multiple R-squared: 0.7355, Adjusted R-squared: 0.7349
## F-statistic: 1386 on 2 and 997 DF, p-value: < 2.2e-16
So it looks like our conclusion has reversed!
This is because there is also a hidden correlation between magnitude and cluster:
magnitude | mean(stations) | sd(stations) | n | percent |
---|---|---|---|---|
East | ||||
4-4.49 | 18.8 | 6.5 | 336 | 42.3% |
4.5-4.99 | 31.3 | 12.0 | 315 | 39.6% |
5+ | 68.2 | 21.4 | 144 | 18.1% |
West | ||||
4-4.49 | 19.3 | 7.4 | 41 | 20.0% |
4.5-4.99 | 29.8 | 10.6 | 110 | 53.7% |
5+ | 62.4 | 26.5 | 54 | 26.3% |
- Earthquakes in the Western cluster have, on average a higher magnitude. In particular there are proportionally much fewer earthquakes in the lowest range of 4-4.5 magnitude.
- As magnitude increases, so does the number of stations reporting.
- Therefore Earthquakes in the Western cluster have a higher number of stations reporting them
- As magnitude increases, so does the number of stations reporting.
- BUT When comparing two earthquakes of the same magnitude earthquakes in the Eastern cluster have, on average, a higher number of stations reporting them than those in the Western cluster.
We have hit on an example here of "Simpson's paradox" - when the observed findings without adjusting for a key source of variation are seemingly in contradiction to those obtained after adjusting for that source of variation.
https://en.wikipedia.org/wiki/Simpson%27s_paradox
It is for exactly reasons like this why models are so powerful; to help us understand how the whole system fits together. Looking at each relationship through a series of separate hypothesis tests would not have been able to tell us this fairly key piece of information.
Interactions
When we consider that the lines on the scatter plot may not be parallel to each other then we start to consider the "interaction" between variables:
y = β0 + β1x1 +β2x2 + β3x1x2 + \(\epsilon\)
The additional β3 term modifies the effect of magnitude. So we now have two straight lines being drawn with different slopes and different intercepts.
So our new model is:
##
## Call:
## lm(formula = stations ~ mag * cluster, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.348 -6.907 -0.144 6.276 48.087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -183.8459 4.6707 -39.362 <2e-16 ***
## mag 47.2784 1.0160 46.532 <2e-16 ***
## clusterWest -9.2390 10.5749 -0.874 0.383
## mag:clusterWest 0.7277 2.2248 0.327 0.744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.28 on 996 degrees of freedom
## Multiple R-squared: 0.7355, Adjusted R-squared: 0.7347
## F-statistic: 923.2 on 3 and 996 DF, p-value: < 2.2e-16
Which is becoming a little complicated to write out in full; but simplifies down again when we consider the lines being drawn on the scatter plot for the east and the west.
For the East Cluster: Number of stations = -184.54 + 47.28 * magnitude
For the West Cluster: Number of stations = -193.09 + 48.00 * magnitude
How we interpret the other results in the summary table also becomes important - notice that in the final model we saw that the effect of cluster did not appear to be significant; but in the previous model it did. That is because we cannot interpret the coefficients OR the standard errors/hypothesis tests without holding everything else to be constant.
So where we have a model including this sort of interaction term, the only hypothesis test we can always reliably interpret from the summary table is the one in the final line, concerning whether or not the interaction term is significantly different from zero. And in this model we do not have sufficient evidence to reject this null hypothesis, as the p-value is 0.744.
Interpreting the hypothesis test for "cluster" would require us to hold the interaction term constant - which can only be done where the magnitude is equal to zero. i.e. this hypothesis test with a p-value of 0.383 is considering whether there is a significant difference between the west and the east cluster for an earthquake of magnitude zero. And that is not a sensible thing to be considering when all of our observations are of earthquakes with magnitude 4 or above!
So do not be tempted to dismiss the importance of variables where you see non-significant results in the presence of interactions. When there is more than one variable different software packages may present different hypothesis tests and p-values around the effects for each variable.
The hypothesis test for "mag" could be interpreted though. As the cluster variable is explicitly coded as a 0/1 binary variable therefore forcing it to be zero for purpose of interpretation is actually a useful constraint. So this term being interpreted as the slope parameter for the "East" cluster; so the hypothesis test is whether there is a significant trend between mag
and stations
within the East cluster.
When building a model it can be tempting to include every possible variable we can think of in it.
As you are starting to see, increasing the number of variables is mathematically possible (as long as you have more observations than you do variables) but:
increases the complexity of being able to interpret what the model is actually telling us.
can make the model over-fitted, and less stable, as we saw in the simulations earlier
There is a statistical principle of "parsimony" - that we do not want to make things more complicated than they need to be. Given the lack of a significant interaction term, then the 'best' model is probably the one without this term included.
There are many ways of making that determination - looking at statistical significance, looking at the change in the R squared values, or looking at other model fit metrics which are a little more nuanced than statistical significance. If you are interested in these you can explore more about the Akaike Information Criterion (AIC) or Bayesian Information Criteria (BIC) (https://medium.com/@jshaik2452/choosing-the-best-model-a-friendly-guide-to-aic-and-bic-af220b33255f)
Making predictions
As our model takes the form of an equation we can plug in any value of x
and receive a predicted value of y
. Remember that our data for magnitude appears in 0.1 jumps - we could predict the expected number of stations for a 4.75 magnitude earthquake for an earthquake in the "West" cluster.
Expected Number of Stations = -184.54 + 47.43 * magnitude - 5.79 * Cluster="West"
= -184.54 + 47.43 * 4.75 - 5.79
= 39.42
But beyond this predicted value there are two useful intervals we can calculate to help link the error term to this prediction: a confidence interval, and a prediction interval.
Interval | mag | cluster | .fitted | .lower | .upper |
---|---|---|---|---|---|
95% Confidence Interval | 4.75 | West | 34.96 | 33.41 | 36.51 |
95% Prediction Interval | 4.75 | West | 34.96 | 12.78 | 57.14 |
We have already come across the idea of a confidence interval - this represents the uncertainty in our estimate and gives the range of values in which we think the 'true' value of the prediction will lie. The confidence interval will decrease in size as the sample size used to build the model increases; as our understanding of the trend will increase. If we calculate a 95% confidence interval this range of values will be interpreted as saying that out of 100 hypothetical samples of earthquakes, in 95% of these samples the predicted value of the number of stations for an earthquake of 4.75 magnitude would be within this range.
The prediction interval represents the variability in our observations. Out of 100 theoretical earthquakes of magnitude 4.75, we would expect 95 of them to fall inside the prediction interval. This interval will not decrease with an increase in the sample size. The variability in the events is not determined by how many events we have data collated on; although we may end up with a more accurate estimate of the variability with a larger sample size this is equally likely to go up as it is to go down.
You should notice a big difference in the two intervals above! The confidence interval is very narrow. From the data available we are 95% confident that the mean number of stations reporting earthquakes with 4.75 magnitude in the West Cluster would be between 33.4 and 36.5. But for any individual earthquake, we are much less certain of exactly how many stations would report it - the prediction interval tells us we would expect it to be between 12.8 and 57.1.
The other difference to note is that the width of the confidence interval varies according the value of x
. It is at it's narrowest where x
is equal to the mean of x
, and gets wider the further away it goes from this value.
This is linked again to the relationship between standard error with the sample size, but in a slightly more subtle way. We have the maximum amount of information available to us around the mean value of x
. The more extreme we move in x
, the less observations we have which are relevant for helping us to understand y
at that value of x
.
The prediction interval does not vary throughout the range of values of x
- this is due to one of the assumptions that we make - the variation in our residuals is constant throughout the range of x
; also known as "homogeneity".
The prediction interval will also not be at all robust to deviations in the assumption we make about the residuals being normally distributed; it will systematically miss out plausible values from the interval if the distribution of residuals is in some way skewed.
The confidence interval is actually very robust to the assumptions we make about the error term regarding the underlying normality of the residuals, due to the central limit theorem.
But the validity of both of these intervals, along with everything else we have interpreted from this model so far, are all dependent on all of the assumptions we need to make so far for the linear regression model to be valid.
So let's focus on these assumptions in a little bit more detail, and look at how we can assess them so that in the future we can make our assessments on which sort of modelling approach may be needed without diving into lots of interpretation first.
Assumptions
It can be a little overwhelming at first to get your head around all of the assumptions underlying the general linear model. But they remain the same, regardless of how many x
variables we have, or whether they are numeric or categorical, or even whether we are transforming them to create non-linear relationships. So it just needs a bit of time and persistence to get used to assessing these assumptions:
- That the relationship between each the continuous variables,
x
, and the outcomey
can plausibly be explained by a straight line ("linearity") - That the variance away from the trend is the same throughout the range of predicted values for y ("homogeneity"), and throughout the ranges of each of the
x
variables ("exogeneity"). These assumptions are strongly linked, and if there is just onex
variable then they are equivalent to each other. - That the error terms in the individual observations are uncorrelated to each other ("independence")
- The residual values have an approximately normal distribution ("normality")
- There are no individual observations which are responsible for driving the overall trends ("leverage")
"Statistics by Jim" has a nice overview: https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/ - some of these assumptions will bias the calculations of the slope and intercept, some of these will impact on the validity of inferences made on the basis of the error term - through calculation of confidence intervals/hypothesis tests/prediction intervals.
Let's go through these 5 points one by one:
Linearity
This is the often easiest one to spot when we have a small number of variables - and something we actually already noted in our exploratory analysis. Although a straight line is clearly not a terrible model for this data, there is very clear evidence of curvature in the relationship. The best way to try to identify this is by fitting a smoothed curve and a straight line side by side, and seeing the difference.
And we see a very clear S-shape in the trend.
But what is actually more important than this pairwise trend is to assess it across the the whole model. Apparent curvature in one pairwise comparison could be explained by a different variable; e.g. if those points at the top of the curve where the trend flattens were all clustered in one geographic location. So to assess linearity we will usually plot the predicted values from the model on the x axis and the residual values on the y axis. It can often be the case that curvature in one variable's relationship with the outcome variable is explained through the way in which other variables fit into the overall model.
Within this plot then we look for any pattern at all. If all variables combined can provide a plausible linear relationship then there will be no trend within the plot - it will just look like random noise on the plot. Instead what we see is something that looks like a clear curved trend:
The clear curvature in this plot indicates that the addition of the "cluster" variable has not been sufficient to solve the problem of linearity we saw in the original relationship between magnitude and number of stations.
Homoegeneity
There are a few different ways to examine this assumption.
From looking at exactly the same plot we have just created! But now rather than looking for a trend, we are looking to see if the points form a triangle or diamond shape. Even if there was no trend: - a diamond shape would indicate that the variability was larger towards the middle of the range of values; this can often happen if there is both an upper and a lower limit on the possible values of y
; which would result in less variability at either extreme end of the scale. - a triangle or funnel shape would indicate that as y
increases the variability increases. This is a very common scenario in many bio-physical relationships - particularly if low values of x represent some form of stable of controlled set of conditions. e.g. where time/temperature/pressure is low then y
may be low and stable but as the time/temperature/pressure increases then y
may also increases but the volatility in y
may also increase.
So if we see no trend in this plot; and no patterns in the variation away from that lack of trend then this is a good sign for our model!
In the plot above we can quite clearly see points starting to "fan out" and become more variable as the value of x increases. This is a very common phenomenon - as values increase so does the variability in those values. The issue does seem mostly constrained to the lowest predicted values though. Above a predicted value of 30 stations the variation away from the line seems more consistent; below that point the variation seems lower. That would be consistent with the outcome variable having a lower constraint; and indeed we know that the number of stations reporting an earthquake must be at least 1 by definition.
Independence
This assumption is one that we need our own knowledge of how the data was collected to be able to assess:
- Do we have a simple random sample from a population, which we believe to be totally fair and unbiased, or something which we believe to be equivalent to a simple random sample?
If the answer to this is "yes" then we can happily tick this assumption off, and skip straight on to the next one. But if the answer is no then we have some follow up questions to consider:
- Could there be dependencies over time?
- Could there be dependencies of space?
- Was the data collected from a clustered or hierarchical process?
If the answer to any of those questions is "yes" then that does not mean that the model is necessarily wrong. It does mean that we should assess whether the level of dependence caused by those factors cannot be explained by the model. If the model can explain the dependence, then our model is effectively as good as it would be if we did have a simple random sample; if not then we will have a potentially biased model.
So we asses this assumption by looking at the relationship between the residuals from the model and the time/space/clustering components. Whether there are relationships between the outcome variable and these factors is not relevant to the assessment of the validity of the model.
Let's look at time, remember that we don't have the date of the earthquake, only the order they occurred.
Although there is a lot of noise in this trend we do now see a slight increasing trend. The residual value is increasing over time. And this may make some pretty coherent sense - over time we would expect to see more stations coming into existence.
It does look like the correlation from one individual earthquake to the next is very low, which is perhaps again not too surprising. Ideally we would be plotting this trend over months or years or possible decades, to get a better idea of the patterns in relation to a more coherent temporal factor. In the absence of that variable let's consider the earthquakes in groups of 50
From this summary we are now seeing the increasing trend over time a bit more clearly; as a result of simplifying some of the noise in the trend. For the earliest earthquakes the residuals are negative and we were over-estimating the number of stations by an average of 2; for the final earthquakes we were underestimating the number of stations by a similar amount.
Although we are not seeing evidence here of much correlation from one group of 50 to the next group of 50 - the trends are looking "spiky" rather than "smooth", and we are not seeing any clear repeating patterns.
It is a relatively small trend however, so this is probably not enough to make us think that a time series approach, or accounting for time explicitly in the model would yield substantially different results. But it does make me wonder how it would look if we did have the actual dates of the earthquakes within this dataset!
Normality
The important point to remember is that the only distribution that matters is the distribution of the residuals, and whether these are approximately normally distributed. Whether any of the x
or y
variables follow normal distributions is completely irrelevant.
This often surprises people, and on top of this even if the normality assumption does not hold, it is the least important of all the assumptions we have covered. Inferences from general linear models are actually very robust to non-normal distributions!
The coefficients themselves are very robust to non-normality in the residuals; the variance and standard errors are also fairly robust as long as we can invoke the central limit theorem. The only part of the modelling process that we have discussed that is extremely sensitive to deviations in normality is the prediction intervals.
Lack of normality in the residuals is often a side-effect of a bigger issue - in the linearity of the trend, or the homogeneity.
We can use some of the exploratory techniques to consider whether these residuals appear to be non-normally distributed, a histogram or a qq-plot:
It looks like there is some deviation from normality - at the upper end of the distribution with a slightly longer upper tail where the points start to fall away from the qqplot line.
But, to be honest, this looks very close to perfect. If this was the only concern we had with the model - we would have no problems at all! While there are fundamental problems with the model we have created, the normality assumption is not one of them.
Outliers & Leverage
Outliers are, in isolation, not a problem for the validity of the model where they are valid results rather than errors. In fact they may help us to get a more realistic estimate of the variability.
However they can be more problematic when the outlier points are also "high leverage" points.
Let's consider an example with a smaller number of earthquakes.
There is one observation with a high magnitude and a low number of stations reporting (coloured in red). There is also one earthquake with 4.5 magnitude which has a similar residual value to the larger earthquake (coloured in green). The residual values for both of these points is very similar - they are both a similar distance from the line. But the red point is going to cause a problem, and the green point is not!
To understand why we can conduct a 'sensitivity' analysis by seeing what happens if we exclude our outlying value. The red line below shows what happens if we exclude the "red" outlier; the green line shows what happens when we exclude the "green" outlier. And the blue line is the trend where all points are included.
Removing the "green" outlier has almost no impact on the model fit - the lines are very close to being directly on top of each other. Removing the "red" point has a very large impact on the model fit - taking it out really changes the slope and the intercept.
To identify issues within a larger model, without having to do repeated rounds of additional analysis, we can quantify which points have high leverage, and use a concept called "Cook's Distance" to work out which points have a dangerous combination of leverage and deviation from the trend: https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a/
The plot above shows the relationship between the residuals and the leverage from the same restricted set of data. We can see the value that was "green" is flagged as point number 41, with a high residual value; the point which is red is flagged as point number 42. Cook's distance provides a trade-off curve between the two - where the points are inside the dotted lines there are likely to be no issues; where outside the dotted lines than this shows individual points will be making a significant impact on the overall model. And from this plot it is clearly flagging point 42 (the red point) as being problematic.
This plot can be produced by standard from any statistical package. If we go back to reproduce from our full dataset, we do not see a problem:
The potential for problematic high leverage points is much larger when we have a small data set.
So what do I do when my model assumptions do not hold up?
Depending on the reasons why the assumptions do not hold then there are many choices:
Add extra variables into the model. Other variables may help to explain why there are non-linear patterns, or why some values appear to be outliers.
Remove outliers, and conduct a sensitivity analysis. By comparing results with and without individual strange looking points, this will help us understand whether these outliers have any impact on our conclusions.
Transform variables. Different transformations may help to model non-linear relationships, particular considering different types of curves. Log transformations can also help with heterogeneity, in particular if variance is increasing as the response variable increases. Sometimes transformations might help to deal with issues with normality as well.
Where we still see problems, there are several things we could do to try to improve our model. Our actions would depend on what problems we observed. We will discuss these options more in Module 5 of the course:
With severe violations of normality, linearity or heterogeneity we may consider moving to a generalised linear model which is a better conceptual fit for our model. This will make a different set of distributional assumption about our residuals.
With severe violations of independence we would look to use a modelling technique that allows us to have an error term that varies according to the dependency - this could be an ARIMA time-series model; a geo-spatial model; or for clustered data a mixed effects model.
With the model we have looked at so far there are 2-3 issues with the general linear model approach that we will need to deal with: - The clear curvature in the trend - The clear issue of non-constant variance - The issue of non-independence of observations; this one is less clear cut but still enough to warrant considering different options
But more on that next time!
Exercises
We are going to take a look at some exercises using the same data you saw on the video - concerning soy beans in Australia.
Reference: Basford, K. E., and Tukey, J. W. (1999). Graphical analysis of multiresponse data illustrated with a plant breeding trial. Chapman and Hall/CRC.
Retrieved from: https://three-mode.leidenuniv.nl/data/soybeaninf.htm
This data contains the year the crop was harvested, the height of the plant in metres, the yield in metric tonnes per hectare, which one of four locations the plant was grown in, the protein content of the beans (in % of total weight), the oil content of the beans (in % of total weight) and whether it was a "local" or "improved" variety of soybean.
Question 1
We are going to start building a model to help us understand what factors relate to the protein content of soybeans. Let's begin by considering whether the oil content is linked to the protein content:
Before we build any formal models - take a look at the relationship, and start to think how it will map up to the linear model.
- Is the relationship likely to be significant?
- Does the trend look linear?
- What might we expect our coefficients to look like?
- Is the variability away from the trend consistent throughout the range of values?
Question 2
Let's now take a look at the summary of the model:
##
## Call:
## lm(formula = protein ~ oil, data = australia.soybean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1643 -1.3377 0.0085 1.2610 5.0099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.84833 0.66788 85.12 <2e-16 ***
## oil -0.82945 0.03324 -24.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.913 on 462 degrees of freedom
## Multiple R-squared: 0.5741, Adjusted R-squared: 0.5732
## F-statistic: 622.8 on 1 and 462 DF, p-value: < 2.2e-16
Try to interpret as many of these numbers as you can!
- What is the equation which is being formulated by this model?
- How much of the variability in protein content of soybeans is explained by the oil content?
- Is the relationship between oil and protein statistically significant?
- What can be said about the residuals, and the amount of variation away from the trend from this output?
Question 3
Below is the plot of the expected value (on the x axis) from the model, and the residual value (on the y axis) to help us assess some of the key model assumptions
- What assumptions of the model would we be assessing from looking at this model?
- Does our model meet these assumptions?
Question 4
We may be a little concerned that oil alone will not make a good model. So let's keep going and try something we haven't really seen yet - by adding in a second numeric variable. The heights of the soybean plants also seem to be correlated to the protein content:
So let's consider adding both variables into the same model.
Below you can look at the model summaries, and the residual plot, for 4 different competing models. Each of the two variables individually, both variables included as independent variables, or both variables included along with an interaction term.
Which of the models looks like it might be the best fit for the data? Consider the values in the residual summaries, residual plots, the R-squared value, the significance of the coefficients, and the residual standard error of each of the models.
Can you interpret in 'real terms' what the parameters of your chosen model imply about the relationship between oil content and height of the soybean plant with the yield?
Are there still any potential issues in the residual plot from this model?
If you decided upon the model with the interaction term as your preferred model, you may have found it a little confusing to map the parameters against a 'real world' interpretation.
You can use the plot below, based on predictions from the model, to help assess what the interaction term might be suggesting about the relationship. One of the two x
variables is broken down into predictions made at specific values from across the observed range and the other is included on the axis.
More Questions!
If you want to keep exploring this data more, you can download it here:
Perhaps you might want to keep building the model further to see whether the variety of soybean, and year it was harvested have any impacts on the protein content, and how they might fit into an overall model.
Download DataReferences
Simple Linear Regression in R (Marin Stats Lectures): https://www.youtube.com/watch?v=66z_MRwtFJM There are lots of subsequent videos on this channel about other statistical modelling techniques
Linear Regression Tutorial: (UC Business Analytics R Programming Guide) https://uc-r.github.io/linear_regression
Common statistical tests are linear models (Jonas Lindeløv): https://lindeloev.github.io/tests-as-linear/
Everything is a linear model (Daniel Roelfs) https://danielroelfs.com/blog/everything-is-a-linear-model/
When to use regression analysis (Statistics by Jim): https://statisticsbyjim.com/regression/when-use-regression-analysis/
OLS Linear Regression Assumptions (Statistics by Jim): https://statisticsbyjim.com/regression/ols-linear-regression-assumptions
UCLA Data Analysis Examples: https://stats.idre.ucla.edu/other/dae/