Module 3 - Hypothesis Testing

Introduction

In this third workbook, we are going to take a closer look at hypothesis testing.

Hypothesis testing is the process that allows us to answer questions about our data, our population, and our models. It is a well-defined process that is always performed in the same way, although the calculations and the data involved will differ depending on the test performed or the question being answered. It is related to, and can be used alongside, some of the results we learned in Module 2, especially the concept of precision measures such as the confidence interval. We will also use the method of hypothesis testing in Modules 4 and 5, when we discuss model fitting.

This is a basic topic in statistics, because it lays the foundation of statistical inference, the process that allows us to make statements and answer questions about a population with incomplete information, as we only have the data contained in the sample.

Experience tells us that this is one of the hardest topics to learn too. We’ll try to avoid all the mathematical apparatus surrounding it, and concentrate on a few important concepts and on the process of testing statistical hypotheses.

In this session, we will cover:

What a statistical hypothesis is, and the concepts of the null and alternative hypotheses.
The concept of the p-value, whether to use one side or two sides of the sampling distribution to compute it, and how it relates to the confidence interval covered in Module 2.
What statistical significance, α, is.
The process of hypothesis testing.
Some basic statistical tests that use the hypothesis testing method.
Review the principles that tell us what p-values can and cannot do.

In the video I go through the process of hypothesis testing, with the use of examples and avoiding formulae.

In these workbooks, all output produced will come from the use of R, and all of the code used in the workbooks can be found on GitHub here: https://github.com/stats4sd/Stats_ReIntro_M3. However, the principles are identical regardless of which data analysis tool is being used.

Whichever tool is being used for data analysis, the user must be familiar enough with it to be able to generate any sort of summary, plot, or transformation that they might suddenly think could be useful!

Data being used in this workbook

In this session we are going to use a public dataset produced by the Ocean Biodiversity Information system (OBIS). It contains the Historic whale catch records from Australian whaling stations (1925-1978). The full data set and other details can be found here https://obis.org/dataset/43e4c2cc-5886-427f-a4a9-d47d2635d246. It contains 38,267 records, but we will use a subset: the data corresponding to the most common caught whale, the Humpback (Megaptera novaeangliae) in the years 1949 to 1962.

We have also edited it to contain only the columns we are interested in. Below you can explore the the data we are going to use in this workbook.

What is an 'Hypothesis'

Question: What is the proportion of males among humpback whales?

Without knowing too much about whales, the obvious answer might be to guess that it is 50%. In fact, within our dataset, there was one whale labelled as a hermaphrodite— not enough to make much of a difference to the percentage — but we will consider the variable of interest here as the percentage of whales which are male, for the sake of creating an easier-to-analyse binary variable.

But let’s have a look at the data to see the results. The figure below shows the percentage over the period 1949–1962. In each year, we find that at least 56% of the whales were male, with the figure as high as 80% in 1951.

If there were no difference in the underlying proportions of male and female whales, then each year there would be a 50:50 chance of having more male than female catches, or more female than male catches – just like flipping a coin.

However, in this dataset, we see that every year we have observed more males than females. If the gender ratio were indeed 50:50, this would be like flipping a fair coin 14 times in a row and getting heads every time. The probability of this happening is 0.5 raised to the power of 14, which is a very small number: 0.0000610.

If we observed this in a coin-flipping experiment, we could confidently claim that the coin was not fair.

So, we can ask ourselves whether the true proportion of males among humpback whales can really be 50%. The process we follow to answer a question like this is called hypothesis testing.

We can define a hypothesis as a proposed explanation for a phenomenon. It is not the absolute truth, but rather a provisional, working assumption. In a hypothesis test, we look at the evidence against that hypothesis by comparing how consistent our data are with the hypothesis.

The Null and Alternative Hypothesis

The null hypothesis (\(H_0\)) is the 'default' hypothesis that we work with until we have sufficient evidence against it. It is generally used as a hypothesis to indicate that ‘nothing’ is happening, i.e. there are no trends, no differences between groups, or the value of interest is equal to 0. Hence the word “null”.

In the example of the whales, we could state the null hypothesis as:

The proportion of male whales is 50% (\(H_0: p_M = 0.5\)) or
The proportion of male whales is equal to the proportion of female whales (\(H_0: p_M = p_F\))

The null hypothesis is what we are willing to assume is the case until proven otherwise. It is extremely conservative, denying any progress or change! However, this does not mean that we believe the null hypothesis is literally true; we are never seeking to claim that the null hypothesis has actually been proved. Instead, we are only ever looking for evidence against the null hypothesis.

In the words of Ronald Fisher, the father of modern statistics:

"the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."

(R.A. Fisher (1935), The Design of Experiments. Oliver and Boyd, p.19).

There is a strong analogy with criminal trials in the English – and indeed many other – legal systems: a defendant can be found guilty, but nobody is ever found innocent, only not proven guilty. Similarly, we may find that we reject the null hypothesis, but if we don’t have enough evidence to do so, it doesn’t mean that we can accept it as the truth. It is simply a working assumption until something better comes along.

The Alternative hypothesis

The alternative hypothesis (\(H_1\)) is the statement that is tested against the null hypothesis. It claims that there is an effect in the population that is caused by a non-random cause. When we have enough evidence against the null hypothesis, we reject it, and we accept the alternative being true.

Often, the alternative hypothesis is the same as the research hypothesis. In other words, it is the claim that you expect or hope will be true.

The alternative hypothesis is the complement of the null hypothesis. Null and alternative hypotheses are exhaustive, meaning that together they cover every possible outcome. They are also mutually exclusive, meaning that only one can be true at a time.

Alternative hypotheses often include phrases such as “an effect”, “a difference”, or “a relationship”. When alternative hypotheses are written in mathematical terms, they always include an inequality (usually '≠', but sometimes '>' or '<'). As with the null hypotheses, there are many acceptable ways to phrase an alternative hypothesis.

In the example of the whales, we could state the null hypothesis as: the proportion of male whales is 50%, or the proportion of male whales is equal to the proportion of female whales. The alternative hypothesis, then, will be one of the following three options:

\(H_1: p_M \ne 0.5\) or \(H_1: p_M \ne p_F\)
\(H_1: p_M \gt 0.5\) or \(H_1: p_M \gt p_F\)
\(H_1: p_M \lt 0.5\) or \(H_1: p_M \lt p_F\)

We will have to choose one of the possible alternative hypothesis, according to what we would like to demonstrate.

The p-value

Let's go just to the final year of our data, 1962, and take a look at the results:

Year	sex	n	percentage
1962	Male	409	56.8%
1962	Female	311	43.2%

We observed 720 whales in this year, of which 409 (56.8%) were male.

This feels like quite a large difference – but the crucial question is whether this difference is large enough to provide evidence against the null hypothesis, or whether we might expect to see such a difference purely by chance from time to time.

If our null hypothesis were true, then the probability of any randomly observed whale being male would be 50%, the same as obtaining heads when tossing a coin. However, across 720 coin flips, we wouldn’t expect to observe exactly 50% heads and 50% tails every time — sometimes there would be more, and sometimes less.

We could run a simulation of 10,000 experiments, each consisting of 720 coin flips, to get an idea of the distribution and how much variability around 50% we would expect to see occurring purely by chance.

## 
## FALSE  TRUE 
##  9996     4

Out of the 10,000 simulations, assuming equal proportions of males and females, we saw an outcome of 409 or more males only 4 times.

So, is it possible that, if the true probability of male and female whales is equal (0.5), we could observe 409 males in a random sample of 720 whales? The answer is YES, it is possible, although very unlikely — in this simulation, it happened just 0.04% of the time.

Instead of running a simulation of 10,000 runs, we could have systematically worked through all the possible permutations of the sample and calculated a theoretical probability under the null hypothesis. Each of these would generate an observed proportion of male whales, and plotting these would produce a smoother distribution than just using 10,000 simulations. Unfortunately, the number of permutations is so high that it cannot be calculated with current means within a person’s lifetime.

Fortunately, though, we don’t have to perform these calculations, since the probability distribution for the proportions under the null hypothesis is based on the binomial distribution. From this, we can determine the probability of a number of successes out of n trials for a given probability of success — 0.5 in our case.

The figure below shows the probability density function (A) and the cumulative probability distribution function (B). In both cases, the value 409 is highlighted with a vertical line to show how likely it is to obtain 409 successes or more in 720 trials, where the probability of success is 0.5. This probability is 0.00011, or 0.011%.

This probability is the so-called p-value, one of the most prominent concepts in statistics as practised today. A formal definition of p-value is: a p-value is the probability of obtaining a result at least as extreme as the one we observed assuming the null hypothesis is really true. As this p-value decreases the strength of evidence against the null hypothesis increases - it starts to seem as though our observations are not compatible with the null hypothesis.

A less formal definition of the p-value is: a p-value is the probability of being wrong when rejecting the null hypothesis to accept the alternative.

One-sided or two-sided test

The p-value is the probability of obtaining a result at least as extreme as the observed result if the null hypothesis were really true. But what do we mean by extreme?

The p-value we obtained, 0.011% is one-tailed, corresponding to what we know as one-sided test, the test that uses as alternative hypothesis \(H_1: p_M \gt 0.5\). This is because we are only considering how likely it is that we have observed such an extreme value (56.8%), where males were more common than females.

However, if we had observed the reverse – and our data contained whales of which 56.8% were female – then this would also have led us to suspect that the null hypothesis does not hold. Therefore, we should also calculate the probability of observing 56.8% or more of either male or female whales. This is known as the two-tailed p-value, which corresponds to the two-sided test. This test uses as alternative hypothesis \(H_1: p_M \ne 0.5\). This two-tailed probability is double the one-tailed probability, 0.021% or 0.00021, since the binomial distribution is perfectly symmetrical, as we saw in the previous section.

Statistical significance

In our example of the proportion of male and female whales, we had the following:

Null hypothesis: the proportion of male whales is 50%.
Alternative hypothesis: the proportion of male whales is not equal to 50%.
We obtained a sample of n=720 whales.
We observed the sex and measured the proportion of male whales in the sample: 56.8%.
We obtained the p-value = 0.00022, which is the probability of obtaining a value as extreme as this if the null hypothesis is true. This is the two-sided test, the one that corresponds to the alternative hypothesis stated above.

While there is a very small chance that we could have obtained this percentage if the null hypothesis were really true, it seems very unlikely. So, the question now is whether we think the p-value is small enough for us to consider that the difference is statistically significant, and, therefore, the alternative hypothesis to be true.

In other words, we are asking whether the difference between our estimate and the population value (the proportion of males) under the null hypothesis is not equal to zero.

It seems extremely plausible that the difference between the proportion of males observed in the sample and the null hypothesis value cannot be explained by random variation alone. It therefore must have occurred because the population value is not equal to the null hypothesis, and the sample data provides evidence for this.

Since we have incomplete information about the population — we only have the sample — we can always be wrong when making statements like the following: "the proportion of male whales is significantly different from 0.5". However, we control the risk of being wrong when rejecting the null hypothesis through the p-value.

Using the analogy of criminal trials in the English legal system, the sample data provide the evidence against the innocence of the defendant, and the p-value is the probability of being wrong if the defendant is declared guilty. Even if we had observed 720 male whales, there would still be a small (extremely small!) probability of that happening if the proportion of males were genuinely 50%.

So, we need to define a significance level, denoted by α, which is the critical threshold value for the p-value, below which we deem that we have sufficient evidence in our case against the null hypothesis.

Therefore:

if the p-value ≤ α, the result is statistically significant, and we reject the null hypothesis
if the p-value > α, the result is not statistically significant, and we cannot reject the null hypothesis

The most commonly used significance level (α) is 0.05 or 5%, but in some fields, a level of 0.1 (10%) or 0.01 (1%) is more conventional, and there is a strong movement in favour of setting the threshold at 0.005 (https://www.nature.com/articles/s41562-017-0189-z). Therefore, some hypothetical results may be significant at one threshold but not at another.

However, we must acknowledge that this threshold is arbitrary. For example, if we use α = 0.05, a p-value of 0.0499 is essentially no different from a p-value of 0.0501, but we would make a completely different decision if we strictly followed the significance level rule. P-values well above or well below α lead us to a straightforward decision, whereas p-values close to α usually lead to an inconclusive decision.

Hypothesis testing

There are a large number of different 'hypothesis tests' (see the section on Statistical tests), each of which allows us to address a slightly different set of hypotheses. However, the general process in all of them follows the same steps:

1-Set up a question in terms of a null hypothesis and an alternative hypothesis that we want to test. We usually use the notation H₀ for the null hypothesis and H₁ for the alternative hypothesis.

2- Choose a test statistic that estimates something which, if it turns out to be extreme enough, would lead us to doubt the null hypothesis. You will not actually have to calculate this statistic yourself, as the relevant test statistic has already been defined for each (standard) statistical test .

3- Collect your data and compute the test statistic.

4- Generate the sampling distribution of this test statistic, assuming the null hypothesis is true.

5- Obtain the p-value by comparing your test statistic with the sampling distribution. The p-value is the probability of observing such an extreme statistic if the null hypothesis is true. When setting up the alternative hypothesis we decided whether to use a one-sided or two-sided p-value. In most tests, we will use the two-sided p-value.

6- Declare the result significant if the p-value is below some critical threshold, α.

There is a good chance that if you have been taught statistics at school or undergraduate level, you may have had to calculate many of these steps by hand and been asked to consult long tables of numbers containing test statistics and critical thresholds.

Thankfully, this process has been obsolete for at least 50 years (even if it is still regularly taught to students!), as we now have computers to do this for us. The "only" thing the researcher has to do is points 1 and 6 above, those that require making decisions.

It is important to emphasise that the exact p-value is conditional not only on the truth of the null hypothesis, but also on all other assumptions underlying the statistical model, such as the absence of systematic bias, the independence of the observations, and the distribution of the variables. We’ll discuss some of these assumptions later.

Confidence Interval for hypothesis testing

Although the confidence interval is a tool used in estimation, we can also use it for hypothesis testing or to support it. Instead of generating the sampling distribution under the null hypothesis for the test statistic and comparing the test statistic with this distribution to obtain the p-value (steps 4 and 5 in the hypothesis testing process above), we can generate the confidence interval for the statistic, and observe whether the null hypothesis value lies within its limits or outside them.

If the null hypothesis value lies within the confidence interval limits, we cannot reject it and the results are not significant. Conversely, if the value lies outside the confidence interval limits, the result is significant and there is sufficient evidence to reject the null hypothesis.

Using a confidence level of 95% and a significance level of 5% will lead to the same results, because they are complementary. The same applies to a confidence level of 99% and a significance level of 1%. If a 99% confidence interval does not include the null hypothesis, then we can reject it with a p-value of less than 0.01.

The figure below shows the 95% confidence interval for the mean length of the whales in 1949. Two dotted lines show two null hypothesis we can test: 13.0 and 13.2. 13.0 (blue dotted line) is included within the limits of the 95% confidence interval, we therefore cannot reject this null hypothesis. However, 13.2 (red dotted line) lays outside the limits, therefore we would reject this null hypothesis and the results are significant at the α = 0.05 level. You can check this results in the interactive 1-sample t-test in the next section.

## 95% Confidence Interval: [ 12.8789 , 13.13862 ]

Statistical tests

In this section, we are going to discuss some of the most commonly taught tests that could be used to answer questions about our data. We are not going to go into the mathematical details, we will focus on the type of test and things to consider when running the test. There are plenty of resources on this tests that you can easily find (see at the end of the workbook, in the External Links and Resources section).

Before any formal hypothesis test, always explore your data and look at summary statistics and plots showing the relationship you are interested in.

Below, I'm going to describe with some detail the simplest test, the 1-sample t-test, and provide then a short list of tests to be used to test specific hypothesis.

1-sample t-test

The 1-sample t-test is used to compare the population mean of one numeric variable, \(\mu\), to a specific value: \[H_0: \mu = \mu_0\] where \(\mu_0\) is a numeric value.

The test statistic (this will be computed by the statistical software) is \[ t=\frac{\bar{x}- \mu_0}{\frac{s}{\sqrt{n}}}\] where

\(\bar{x}\) is the sample mean
\(\mu_0\) is the null hypothesis value to be tested
s is the sample standard deviation
n is the sample size.

This t statistic is compared to its sampling distribution were the null hypothesis true (a Student's t with n-1 degrees of freedom), and a p-value is obtained. Then a decision is made based on the p-value and the level of significance we wish to set. This will be computed by the statistical software as well.

Let's look at our whales from a different year to before, this time 1949. The summary statistics are:

Year	n	Mean Length	SD Length
1949	193	13.01	0.91

And perhaps we want to test the null hypothesis that \(\mu_0\) = 13.0. Maybe we have reviewed some scientific literature explaining that this is a standard size for whales, and we are interested to know if our sample matches the value.

The figure below shows the distribution of the sample lengths, with the sample mean in red, and the null hypothesis value in blue.

Using any statistical software we can then formulate this as a one sample t test.

## 
##  One Sample t-test
## 
## data:  Whale_49$length.m.
## t = 0.133, df = 192, p-value = 0.8943
## alternative hypothesis: true mean is not equal to 13
## 95 percent confidence interval:
##  12.87890 13.13862
## sample estimates:
## mean of x 
##  13.00876

Then, the t-statistic is t = 0.133, and the p-value = 0.894, so we cannot reject the null hypothesis based on this evidence.

In the tool below you can see what happens to the test, as we change our hypothetical null hypothesis value. The dotted red lines represent the chosen confidence interval around our sample mean.

Enter Null Hypothesis Mean:

Confidence Level

You can test different null hypotheses and see how the results change.

Note that, although the null hypothesis may change, the limits of the confidence interval do not, unless you change the level of confidence.

The 1-sample t-test is based on some assumptions:

the sample mean, \(\bar{x}\) is normally distributed. For sample sizes n > 5 this is usually the case, due to the Central Limit Theorem, that states that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough, regardless the distribution of the data. As we will have just one sample mean, we will have just to assume that this is the case.
The individual observations are independent. This is a very strong assumption, and we will find it in most of the other 'simple' hypothesis tests. It essentially means that one measurement does not depend on any of the other measurements. When measuring the length (or observing the sex) of a whale, the length or sex of the previously measured/observed whales do not provide any information on this one: if the previous whale was a male, or a long whale does this provide any further clue on what sex or what length will be the current whales? But think about the process of sampling whales across the whole year - there is a good chance we may end up observing the same whale on more than one occasion. If we were able to tag and track the whales we would be confident that this was not happening; but whether this was able to be done in 1949 is perhaps more of a challenge!

Other examples might relate to the concentration of a pollutant downstream in a river is not independent from the concentration of that pollutant upstream, or the temperature today is not independent from yesterday's temperature, or samples being obtained from a clustered process where samples from one cluster will not independent of each other.

2-sample t-test

To test whether two population means are the same, we are going to use the same test we used to test one sample's mean: the t-test. Testing that two values (the two population means) are the same, is equivalent to test that their difference equals to zero: \[H_0: \mu_1 - \mu_2 = 0\] where \(\mu_0\) is a numeric value.

Our software of choice will do the hard work for us and we will obtain a p-value that will allow us to make a decision regarding the two population means.

We have to add one assumption to those we needed in the 1-sample t-test:

the difference of sample means, \(\bar{x_1}-\bar{x_2}\) is normally distributed.
The two populations being compared have the same variance. We can test this. There are different tests, we are going to use the F-test, see below, although some times a plot could be enough to understand if this assumption holds.
The individual observations and the two populations are independent.

Example: we want to compare if the mean length of male and female whales was the same in 1949: We set the null hypothesis as:

\(H_0: \mu_M - \mu_F = 0\)

sex	n	Mean Length	SD Length
Male	135	12.90	0.78
Female	58	13.25	1.15

Using any statistical software we can then formulate this as a one sample t test.

## 
##  Welch Two Sample t-test
## 
## data:  length.m. by sex
## t = -2.1203, df = 80.201, p-value = 0.03707
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.67766119 -0.02148313
## sample estimates:
##   mean in group Male mean in group Female 
##             12.90370             13.25328

The p-value = 0.03707. Therefore, we can reject the null hypothesis if we set the significance level at α = 0.05, but we could not if we set it at α = 0.01.

Comparing two variances: F-test

This is a test that we can perform along with the two-sample t-test, to make sure that the assumption of equal variances holds.

F-test to compare two variances:

\(H_0: \sigma_1^2 = \sigma_2^2\)

In the example we used to perform the two-sample t-test, the figure showing the histograms of the male and female sub-samples lengths does not show any difference in the variability of the data for the two groups. Let's, however, perform the test:

\(H_0: \sigma_M^2 = \sigma_F^2\)

## 
##  F test to compare two variances
## 
## data:  length.m. by sex
## F = 0.45578, num df = 134, denom df = 57, p-value = 0.0002297
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2872370 0.6950402
## sample estimates:
## ratio of variances 
##           0.455779

The F-test p-value = 0.0002297. We, therefore, reject the null hypothesis that the two variances are equal, so in the t-test we performed above the assumption of equal variances does not hold.

Non parametric tests

We can test whether the center of a population is a given value, or that two distributions are equal without any assumption on the distribution of the variable or its mean. For each parametric test such as the t-test (it is parametric because we make assumptions on the population distribution), there is a non-parametric test that we can use when we cannot (or we do not want to) make any assumption on the population distribution. They can be used always, even when we could use the parametric test such as the t-test, however, their results are usually less conclusive.

The non-parametric test equivalent to the t-test tests is the Wilcoxon test. It is based on ranks instead of the actual values.

Comparing one proportion to a value or two proportions

The tests to compare proportions are based in the fact that we can use the normal approximation. This is not an assumption, it is a fact, so there are no distribution assumptions to be made.

Testing that one population proportion is equal to a value \(p_0\), means setting the null hypothesis: \[H_0: p = p_0\]

Testing that two population proportions are equal, means setting the null hypothesis: \[H_0: p_A = p_B\] Example: is the proportion of males = 50%:

\(H_0: p_M = 0.5\)

Or the proportion of males and females is equal:

\(H_0: p_M = p_F\)

Comparing the mean of more than two numeric variables: ANOVA and Kruskal-Wallis test

To test if k > 2 population means are equal, we will use the ANOVA and the Kruskal-Wallis test. ANOVA is the acronym of ANalysis Of VAriance, because it compares the variability among the \(k\) sample means with the variability within the samples.

For the ANOVA, the null hypothesis to test is: \[H_0: \mu_A = \mu_B = \mu_C = ... = \mu_K \].

Our software of choice will do the work and we will obtain a p-value that will allow us to make a decision regarding the whether the k population means are equal or at least one of them is different.

For the ANOVA, we will have the same assumptions that we had for the 2-sample t-test:

the sample means, \(\bar{x_i}\) are normally distributed.
The k populations being compared have the same variance. We can test this, however, we will usually perform the test graphically, and, therefore, we will not get any p-value out of it.
The individual observations are independent.

The non-parametric test alternative is the Kruskal-Wallis test. We will only need to test the assumption of independence to perform this test.

Both tests, ANOVA and Kruskal-Wallis test the null hypothesis that all the k population means are equal. When we reject the null hypothesis we have evidence that not all the means are equal, but this can mean, basically, anything: k-1 means are not different and 1 mean is different from all the others, or all k means are different from one each other, or anything in between. We have tests to better understand what it means to reject the null hypothesis in an ANOVA or Kruskal-Wallis tests, that involve multiple comparisons: comparing each mean to all the others. The number of comparisons to perform grows quite quickly with n: if n=3 we need 3 mean comparisons, if n=4 we will need 6 comparisons, and if n=5, 10 comparisons are needed.

The p-value is the probability of being wrong when rejecting the null hypothesis, and we usually set a significance level for the p-value as a threshold to reject the null hypothesis. For example, if we set α = 0.05, the probability of detecting a difference just by chance (incorrectly rejecting the null hypothesis) is 5%. However, this is valid only when we perform one single test. When we compare groups multiple times, the probability of finding at least one difference just by chance increases depending on the number of times we perform the comparison: for n=4 comparisons, the probability of finding at least one difference just by chance is 26.5% and this probability increases to 40.1% when n=10. The multiple comparison problem also applies to confidence intervals: the probability that at least one 95% confidence interval does not contain the true population value when obtaining 10 simultaneous 95% confidence intervals is 40.1%!

Therefore, when we reject the null hypothesis that not all k population means are equal, we cannot simply perform the all the pairwise comparisons using a 2-sample t-test for each of them to understand which population mean is different from which. We need to use a method that allows the simultaneous pairwise comparison of all the k population means.

p-value and sample size

We have already discussed the relationship between the significance level, the p-value, and the confidence interval. In Module 2, we examined the relationship between the width of the confidence interval and the sample size, and we saw that the width of the confidence interval is a function of \(1/\sqrt(n)\), where n is the sample size.

Therefore, if there is a true difference (whether between two population means, between k population means, or between the population mean and a specific value)...

We are more likely to detect a difference by rejecting the null hypothesis as the sample size increases.
AAt the same time, for a given sample size, the smaller the difference we wish to detect, the more difficult it will be to reject the null hypothesis.
In addition, the variability of the data also plays a role in hypothesis testing: given a sample size and a difference we wish to detect, the greater the variability of the data, the more difficult it will be to reject the null hypothesis.
Finally, if all of the above are fixed, the minimum required sample size depends on the sampling scheme used. It will be smallest for a simple random sampling design and larger for other designs, with the extent of the increase depending on the specific design employed.

Therefore, we can determine our sample size as the smallest that will allow us to reject the null hypothesis, given the minimum difference we would like to detect. To do this, we need to define this minimum detectable difference and obtain an estimate of the population variance.

We can also determine the minimum sample size required to detect a difference in proportions. The key difference is that the minimum sample size depends on the proportion \(p\), being maximal at 0.5 and decreasing as it approaches the limits of 0 or 1. Therefore, we need an initial estimate of the proportion \(p\) that we wish to estimate and/or test.

GGiven all the assumptions that need to be made, determining the minimum sample size can be challenging. A common recommendation is to use the entire available budget for data collection.

What p-values can and cannot do

We have a coin that we assume is fair. We flip it several times and observe the result: number of heads in n flips.

The null hypothesis is that the coin is fair, \(H_0: p_H = 0.5\).

We flip the coin 10 times and we obtain 10 heads. This is our sample.

## 
##  1-sample proportions test with continuity correction
## 
## data:  10 out of 10, null probability 0.5
## X-squared = 8.1, df = 1, p-value = 0.004427
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.6554628 1.0000000
## sample estimates:
## p 
## 1

The p-value is small enough (0.004) for us to confidently reject the null hypothesis. Right? Well, maybe not!

Now, let’s carry out the experiment in a slightly different way: 1,024 people are given a fair coin, and they flip it. Those who get heads remain, and those who get tails leave the room. We expect about 50% to stay and 50% to leave – approximately 512 people. Then we repeat the process a second time, a third, a fourth, and so on, up to a tenth time. We expect just one person to still be in the room at the end of this process. If you repeat your experiment many times, you will probably end up rejecting your null hypothesis, even if it is true.

As you can see from this example, the p-value can sometimes be misleading. And it's not just that its definition is a bit convoluted. Even if we accept the definition, if you repeat an experiment enough times, you can eventually get a p-value that appears to demonstrate what you wanted to demonstrate – purely by random chance.

There are criticisms regarding the use of p-values; there is an entire approach to statistics, the Bayesian approach, that outright rejects them. In 2016, the American Statistical Association managed to bring together a group of statisticians who agreed on six principles concerning p-values:

P-values can indicate how compatible the data are with a specific statistical model

P-values do this by essentially measuring how surprising the data are, given the null hypothesis.

P-values do not measure the probability that the null hypothesis is true

We have discussed this previously

Scientific conclusions and decisions should not be based only on whether a p-value passes a specific threshold

The significance thresholds are, as we have seen, arbitrary. A significant result does not necessarily mean that we have proven a discovery.

Proper inference requires full reporting and transparency

In the example above, when 1,024 people each flipped a coin 10 times, we saw that if you repeat the experiment enough times, you will eventually obtain a significant result, even if the null hypothesis is true. We should report our sampling scheme, make data available, and describe the methods used so that others can replicate our experiment and confirm whether there is a real effect or whether our result was due to chance.

A p-value does not measure the size of an effect or the importance of a result

A significant result tells us that the observed difference or effect is unlikely to be due to random chance, but it does not tell us anything about how large or small this difference (or effect) is.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. I.e. p-value close to 0.05 offers only weak evidence against the null hypothesis

Exercises

In this set of exercises, there are a few questions to check that you have followed the key messages from this module, and that you can use them to answer questions about your data. We will discuss the questions in this week's webinar – please make a note of any major challenges you can identify in this process, either linked to the scenario described or ones you have faced, or anticipate facing in the future when conducting hypothesis testing processes. We can then talk through these problems during the session.

Question 1

The picture below comes fro the paper: Cross-country variation in people’s connection to nature Soga, Masashi et al. One Earth, Volume 8, Issue 2, 101194

You can access the paper here: https://www.cell.com/one-earth/fulltext/S2590-3322(25)00020-X

Interpret the results of the table:

What is the null hypothesis associated to each variable?

Click to reveal answer

Since the test are about the effect of each variable, the null hypothesis is that there is no effect, or that the effect = 0.

What is the alternative hypothesis associated to each variable?

Click to reveal answer

The null hypothesis is that there is an effect, or that the effect \(\ne\) 0. This is the usual alternative hypothesis.

How do you interpret each of the p-values in the table

Click to reveal answer

The first two variables have a p-value that will not allow us to reject the null hypothesis. All the other p-values allow to reject the null hypothesis, and what is shown is the "maximum" level at which they are significant. For example, Gender is significant at α = 0.05 level, but not at α = 0.01 level, and Natural disaster risk is significant at α = 0.01 but not at α = 0.005 level.

Can you provide the (approximate) 95% confidence interval for the first (Biodiversity) and last variable (Gender) in the table? How do you relate the 95% confidence interval and the significance of the result?

Click to reveal answer

the 95% confidence interval is, approximately, estimate ± 2xSE. Then, for example, for Gender the (approximate) 95% confidence interval would be [0.01; 0.09], that does not include the 0, and hence the p-value is < 0.05, which is true.

Is the estimate related to the p-value?

Click to reveal answer

The p-value depends on how far from 0 the ratio \(\frac{|estimate|}{SE}\) is. For example, Education has an estimate that is smaller than that of Natural disaster risk, but its p-value is also smaller because its standard error (SE) is smaller as well.

Question 2

We want to test the hypothesis that the mean length of humpback whales is equal for males and females. We have data collected in 1949 as a sample, which we can consider to be representative of the population of humpback whales.

\(H_0: \mu_M = \mu_F\)

sex	n	Mean Length	SD Length
Male	135	12.90	0.78
Female	58	13.25	1.15

Choose the alternative hypothesis ans see if/how the result of the test changes:

\(H_0: \mu_M \ne \mu_F\)

## 
##  Welch Two Sample t-test
## 
## data:  length.m. by sex
## t = -2.1203, df = 80.201, p-value = 0.03707
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.67766119 -0.02148313
## sample estimates:
##   mean in group Male mean in group Female 
##             12.90370             13.25328

\(H_0: \mu_M \lt \mu_F\)

## 
##  Welch Two Sample t-test
## 
## data:  length.m. by sex
## t = -2.1203, df = 80.201, p-value = 0.01854
## alternative hypothesis: true difference in means between group Male and group Female is less than 0
## 95 percent confidence interval:
##         -Inf -0.07521609
## sample estimates:
##   mean in group Male mean in group Female 
##             12.90370             13.25328

\(H_0: \mu_M \gt \mu_F\)

## 
##  Welch Two Sample t-test
## 
## data:  length.m. by sex
## t = -2.1203, df = 80.201, p-value = 0.9815
## alternative hypothesis: true difference in means between group Male and group Female is greater than 0
## 95 percent confidence interval:
##  -0.6239282        Inf
## sample estimates:
##   mean in group Male mean in group Female 
##             12.90370             13.25328

External Links and Resources

Spiegelhalter, D. (2019) The Art of Statistics. Learning from Data - Chapter 10: Answering Questions and Claiming Discoveries (An excellent guide for gaining a better intuition for the statistical concepts in use in the real world today, with great concrete examples. Big emphasis on process and where things go wrong in practice, with very limited mathematical apparatus. An essential read for practitioners to help us better articulate how and why we use statistics.)

Box, G.E.P., Hunter, J.S. and Hunter, W.G. (2005) Statistic for Experimenters. (A classic guide and reference for the application of statistical methods, especially as applied to experimental design. Provides lots of examples, although almost none of them applied to environmental sciences. A classic and very didactic book to understand statistical thinking and hos statistical methods work.)

Null and Alternative Hypotheses | Definitions & Examples

Hypothesis Testing, P Values, Confidence Intervals, and Significance

Choosing the Right Statistical Test | Types & Examples

Confidence Interval or P-Value?

Common pitfalls in statistical analysis: The perils of multiple testing