Module 2 - Sampling and Estimation

Introduction

In this second workbook, we will be taking a closer look at sampling and estimation.

Sampling consists of selecting some part of a population to observe, so one may estimate something about the whole population, or infer and test hypotheses about the population characteristics, or fit models that explain or predict some population outcome. In this workbook we are going to discuss some general ideas about sampling and their relationship to estimation. In Workbook 3 we will discuss inference and hypothesis testing and in Module 4 and 5 model fitting.

As an example of sampling for estimation, in order to estimate the amount of lichen available as food for caribou, a biologist collects lichen from selected small plots within the study area. Based on the dry weight of the sample collected, the available amount of food for the whole region can be estimated. ’Similarly, to estimate the average weight of a bird species, some individuals are captured, weighed (and released), and the average weight is estimated based on these measurements..

Sampling is the process used to collect the data, estimation uses the sampled data to provide an estimate of a population parameter (that is unknown) using such data.

In this session we will cover:

What sampling is and what the desirable properties of a good sampling scheme are
The definition and some examples of the two types of errors that we will find when using sampled data to estimate population characteristics
The basic sampling scheme, the Simple Random Sampling, the estimation of population characteristics under this scheme and the computation of Confidence Intervals
Finally, we will review some of the most common sampling strategies used in environmental sciences.

In the video I will walk through the process of sampling and the characteristics of the estimates we will obtain from it. All estimates should be accompanied by a measure of its precision (or uncertainty), and in the video I discuss three of such measures.

In these workbook, all output produced will come from the use of R, and all of the code used can be found on GitHub here: https://github.com/stats4sd/Stats_ReIntro_M2. However, the principles are identical regardless of which data analysis tool is being used. Whichever tool is being used for data analysis - it will require that the user is familiar enough with it to be able to generate any sort of summary and compute precision measures such as a confidence interval!

Data used in this workbook

In this session we are going to use a public dataset produced by the Ocean Biodiversity Information system (OBIS). It contains the Historic whale catch records from Australian whaling stations (1925-1978). The full data set and other details can be found here https://obis.org/dataset/43e4c2cc-5886-427f-a4a9-d47d2635d246. It contains 38,267 records, but we will use a subset: the data corresponding to the most commonly caught whale, the Humpback (Megaptera novaeangliae), that contains 23,581 records.

We have also edited it to include only the columns we are interested in. Below, you can explore the data we are going to use in this workbook.

Population and Sample

In statistics, the term “population” has a slightly different meaning from the one given to it in ordinary speech. It does not necessarily refer only to people or animate creatures—such as the population of Britain, for instance, or the dog population of London. Statisticians also speak of a population of objects, events, procedures, or observations, including such things as the quantity of lead in urine, the amount of heavy metals in the liver tissue of wild salmon, or the area burned in wildfires. A population is thus an aggregate of creatures, things, cases, and so on.

we call Population the entire group of individuals to which the findings of the study are to be extrapolated. The individual members of the population whose characteristics are to be measured are the units of the population. The units can be, essentially, anything: people, individual birds of a given species, ground plots, plants, individuals of an animal species, households, hospitals, schools, or businesses. Examples of characteristics that we can measure in populations include the proportion of males/females, the proportion of a given subspecies (e.g., the proportion of Humpback whales among the population of whales), or the average height or diameter of trees, or the average length of the population of Humpback whales.

The population must be fully defined so that those to be included and excluded are clearly spelt out (inclusion and exclusion criteria). There is, however, a trade-off between stricter conditions for inclusion – which may reduce variability in the population but decrease the generalisability of the results. For example, we can define our population of whales as all whales in the world, or just Humpback whales, or female Humpback whales, or female Humpback whales along the shores of Australia, or … We can define a population as the trees of a given species in a single area of a forest, or in the whole forest, or across a region of a country, or in the whole country, or …. The less strict the inclusion criteria, the larger the variability in the population; the stricter the criteria, the more homogeneous the population will be, but also the smaller the population from which we can make generalisations based on our sample.

Although a researcher should clearly define the population he or she is dealing with, they may not be able to enumerate it exactly. For instance, in ordinary usage, the population of England denotes the number of people within England’s boundaries, perhaps as enumerated in a census. But people die and are born continuously; some people may leave the country, whereas others may come from abroad to live in it. Defining the population is not always straightforward.

Sample

Although we would like to know the value of certain characteristics of the population, we can’t measure all the units due to time and budget constraints, but also because sometimes the process of measurement affects, or even kills or destroys, the unit. Therefore, we are not supposed to do this for the whole population. Think of the lichen example above: if we collected all the lichen in an area and dried it, we would have none left! Or if we wanted to study the lead content in the liver of salmon, we would have to kill all the salmon to obtain the results!

Therefore, we usually select a subset (the Sample) from the original population and observe/measure all the units in it, and use this to get an Estimate of the population value or answer questions about the population value. A well-chosen sample will contain most of the information about a particular population characteristic.

Sampling Units

With many populations, it is straightforward to identify the types of units to be sampled and conceive a list or frame of the units in the population, regardless the practical problems of obtaining the frame or observing the selected sample. A complete list of the units in the target population would provide an ideal frame from which the sample units could be selected. The Humpback whale dataset is this sampling frame, as it contains the list of all possible whales we will consider in our study. In practice, it is often difficult to obtain a list that corresponds exactly to the population of interest.

With many other populations, it is not so clear what the units should be. For example, in a survey of a natural resource or agricultural crop in a region, the region may be divided into a set of geographic units (“plots” or “segments”) and a sample of units may be selected using a map. However, one is free to choose alternative sizes and shapes of units, and such choices may affect the cost of the survey and the precision of the estimates.

The image below is a close-up of the sample tube on the ground corresponding to the seventh sample collected by the NASA’s Perseverance Rover in Mars. Perseverance has been taking duplicate samples from each rock target the mission selects. The rover currently has all 18 samples taken so far in its belly, including one atmospheric sample. A key objective for Perseverance’s mission on Mars is astrobiology, including the search for signs of ancient microbial life. The rover will characterise the planet’s geology and past climate, pave the way for human exploration of the Red Planet, and be the first mission to collect and cache Martian rock and regolith (broken rock and dust). https://science.nasa.gov/mission/mars-2020-perseverance/

12th Perseverance Rock Sample Bearwallow on Three Forks Sample Depot:https://commons.wikimedia.org/wiki/File:12th_Perseverance_Rock_Sample_Bearwallow_on_Three_Forks_Sample_Depot.png

Let’s try to identify the key concepts in this process:

The population is the area on Mars explored by the rover
The unit is the content of Martian rock and regolith in a single tube collected by the rover
The sample contains 18 units

Basic Ideas of Sampling and Estimation

The procedure by which the sample of units is selected from the population is called Sampling. Sample units are then observed and some characteristic is measured. The sample will consist of these measurements along with other information that may allow the identification of each unit.

We will, then, use this sample data to estimate some population characteristic. However, while the population characteristic remains fixed, its estimate depends on which sample is selected. Therefore, different samples from the same population will produce different estimates of the same population characteristic.

There are two desirable characteristics of the estimates that can be controlled when we design the sampling scheme:

being unbiased: the estimates are targeting the true (but unknown) population value. The estimate may not have the exact same value as the population value, but will be close to it, or “around” it. This is achieved obtaining representative samples by selecting the units in the sample randomly (or, at least, with some degree of randomness), using probabilistic sampling methods.
The precision of the estimate: how close we expect our estimate to be with respect to the true population value, the closer it is, the more precise the estimate. Precision is controlled with the sample size, the larger the sample size, the greater the precision of the estimate.

The picture below shows the two concepts of precision and bias.

Each dot represents an estimate obtained from one sample, and the bull’s eye the population characteristic of interest.

The top row shows biased results, as they are not aiming the target (the bull’s eye)
The bottom row shows unbiased results, aiming at the right target
The left-hand column shows results with little precision, scattered around the target, whether biased or not.
The right-hand column shows results with great precision, concentrated on the target, whether biased or not.

The problem, however, is that we usually do not know the actual value of the population characteristic, and that we have only one sample!

A quick and easy way to look at this is obtaining the mean of samples from the result of rolling a dice n times. Let’s assume the dice is perfectly balanced, if it weren’t we would obtain biased estimates. When the dice is perfectly balanced we know what we expect:

the proportion of 1s, 2s, 3s,… should be approximately the same, and approximately 1/6 or 16.67%
the average of the result of rolling N dice will be close to 3.5

Once you select below the number of dice, N, you will obtain a new random sample. We can use this sample to estimate the probability to obtaining a result (for example, a “six”) and the population average. We will use the new sample mean as the new population mean estimate and the proportion of “sixes” in the sample as the estimate of the probability of obtaining a “six”. By generating repeated samples, you will observe that their averages are probably not equal to the expected mean (3.5) but sometimes greater than 3.5, sometimes smaller, and occasionally exactly 3.5. This highlights the concept of an unbiased estimate, as it targets (with some error) the true population value of 3.5, and the concept of precision, as the estimate will rarely match the actual population value, allowing us to observe how far it is from it..

Now, let’s change the sample size, for example n=100 or n=1000. You will observe how the sample mean still varies each time, but deviates less from the true value of 3.5. The same happens with the proportion of “sixes”, the larger the sample size, the closer it is to 16.67%.

Number of times to roll dice

The beauty of using simulations is that you already know what you should obtain, and therefore you can compare the results of your simulation with the expected result.

What happens if you roll an unfair dice, one that has 2 sixes, and therefore one of the values one to five is no longer in the dice? If the population average result is 3.5 for fair dice, it will be different for this unfair dice. The proportion of sixes with a fair dice is 1/6=16.67%, this will change now with the unfair dice. Roll it a few times and see what happens. Can you guess which value was removed to introduce the second six? You can also observe how the precision increases as the sample size increases.

Number of dice rolls

Select Dice Type

You may have observed that the expected mean for the unfair dice will tend towards 3.8 as the number of rolls increases. This tells us that the number that was removed from the dice to include the second six was the four, just in case you plan on using this tool to help you plan on cheating on any dice games!

Sampling and non-sampling errors

One sample is selected at random but other potential samples could have been selected, which may have produced different results. The difference between a statistic derived from a sample and the (unknown) population value is caused by what are known as the sampling and non-sampling errors. These errors are inevitable and, despite its name, “error”, it doesn’t mean necessarily that we have done something wrong or made a mistake.

The sampling error comes from the fact that just a part of the population is included in the sample, and therefore the estimate will (most likely) be different from the population value. The sampling error affects the precision of the estimate. We will always have sampling error, it’s inevitable, but we know how to handle it. See below in the Estimation section to understand what this means.

The non-sampling errors come from all other fronts:

the sample is not representative, for example, in a fish survey some selected sites may not be observed due to rough weather conditions; sites farthest from shore are most likely to experience such weather problems,
not measuring all the sample units, for example a non-response in a survey or an animal or plant that dies.
measuring the characteristic with measurement or recording error. For example, in the Humpback dataset, there are 31 whales with length = 0, or misrecording a whale with actual length = 12.79m as 17.29m.
when dealing with humans, response bias is a source of non-sampling errors. It happens when a number of different conditions or factors may cue respondents to provide inaccurate or false answers during surveys or interviews. For example, a professor asks their students whether they have ever cheated on exams in their class, the participants are unlikely to respond truthfully; instead, their responses are likely to be biased toward what they think is expected of them.

The non-sampling error affects the accuracy of the estimates, and is the cause of bias.

Non-sampling errors are unavoidable, but they can be reduced by being careful during design and data collection stages:

At the planning stage, all potential non-sampling errors are listed and steps taken to minimise them are considered. You should establish a sensible data quality assurance system.
Critically review the data collected and attempt to resolve queries immediately they arise.
Document sources of non-sampling errors so that the results presented can be interpreted meaningfully.
If data are collected by others, ensure you understand the data collection process and the quality assurance system that was put in place before you decide to use the data.

Designing a sample using a scientific approach can help to minimise sampling error and create estimates that are precise and unbiased.

Probabilistic and non-probabilistic sampling

Sampling methods can be categorised into two very broad classes on the basis of how the sample was selected: probabilistic and non-probabilistic sampling.

In a probabilistic sampling scheme, every element in the population has a known, non-zero probability of being included in the sample. A non-probabilistic sampling plan doesn’t have this feature.

In probabilistic sampling, because every element has a known chance of being selected, we can obtain unbiased estimates of the population characteristics of interest, and we can also know their precision.

Non-probabilistic sampling, on the other hand, does not have this feature, and we don’t have a firm method of assessing either the reliability nor the validity of the resulting estimates.

In the coming sections we are going to describe some of the most popular sampling methods in environmental sciences.

Sampling frame

In probabilistic sampling, the probability of any element appearing in the sample must be known. To accomplish this we need some list of all the population units from which the sample can be selected. This list is called a Sampling Frame and should have the property that every element in the population has some chance of being selected in the sample by whatever method is used to select elements from the sampling frame.

Ideally, a sampling frame should be a complete list of all members of the target population. In reality, however, this is rarely the case in the environmental sciences. In some situations it is impossible to devise a sampling frame for the study population. Suppose we are interested in the birds that cross an area (say a valley) during their spring migration; in this case, we won’t be able to finish the list until the seasonal migration is complete. But even when a sampling frame is not available, random sampling can sometimes be applied, for example using systematic sampling. This will be discussed below in the Sampling strategies in environmental sciences section.

Basic Sampling: Simple Random Sampling

In the previous sections we have seen what happens when we take several samples from the same population. However, in a research study we usually can get just one sample, so we have to make most of it.

Simple Random Sampling (SRS) is the probabilistic sampling design in which every unit in the population has exactly the same probability of being selected. It is the most efficient sampling design, as it provides the most precise estimates of the population characteristic for a given sample size, and these estimates are unbiased.

The winning numbers of the lottery, the result of rolling a dice or tossing a coin are examples of simple random sampling, as we know beforehand that all the possible outcomes have exactly the same probability. We can extract a Simple Random Sample of whales from our dataset by allocating the same probability of being selected to each whale.

Unfortunately, it is not always possible to use this sampling plan. We will discuss later, in the Sampling strategies in environmental sciences section, other options when SRS is not possible or economically inefficient.

Estimation

In a Simple Random Sampling scheme, the best estimate for the

population mean is the sample mean
population variance is the sample variance
population proportion (or probability) is the sample proportion.

The estimate is best because it is unbiased and with maximum precision.

These estimates provide a single value for the population characteristic, they are known as the point estimates. If the SRS has been conducted properly, we can expect no biases. However, as we saw previously in the sample obtained by rolling a dice several times, they are not (in general) equal to the population value. Sometimes they will be greater than it, sometimes smaller, and in very few cases, and just by random chance, equal to the sample mean.

We, therefore, would like to know how precise our estimate is, how close it is to the population value. So we add to the point estimate a measure of its precision. There are different ways of computing this degree of precision (or uncertainty):

standard error (usually used for means or variances)
margin of error (usually used for proportions)
confidence interval (used for means, variances and proportions).

All of them are related, so if you know one of them you can easily obtain the other two. For example, a possible margin of error is calculated as: Margin of error = 2 × standard error. We will discuss the confidence interval in the next sub-section.

Note that we use the term error to measure the degree of precision/uncertainty in our estimates.

If we obtain k samples of n units each from the same population, and compute the mean of each sample, we will have k estimates of the population mean (and the same applies to the variance or the proportion, when applicable), as their means will vary from one sample to another. Like the series of observations in each sample, the sample of k sample means has a standard deviation. The standard error of the mean of one sample is an estimate of the standard deviation that would be obtained from the means of a large number of samples drawn from that population.

One of the interesting characteristics of the three precision/uncertainty measures above is that they depend on the sample size n. Specifically, they are a function of 1/sqrt(n) where n is the sample size. Therefore, the larger the sample size, the more reduced will the uncertainty measures of the estimates be. However, the gain in precision is big when we increase a small sample by one unit, but this gain will decrease as the sample size increases.

The figure below shows how the standard error (SE) of the mean varies with the sample size for the sample of results obtained by rolling a dice n times. For very small sample sizes, an increment of the sample size implies a sizable decrease in the SE of the mean, whereas, as the sample size grows, the decrease in the SE of the mean for one extra element in the sample tends to zero.

Since the precision of our estimate depends on the sample size, it can be established at the design stage, before we collect the data, provided that we have some prior information. We will need an estimate of the population

variance/standard deviation to obtain the SE of the mean: SE=std(y)/sqrt(n)
proportion to obtain the SE of a proportion SE=sqrt(p(1-p)/n)

from previous data or knowledge.

It is important to realise that we do not have to take repeated samples in order to estimate the standard error; there is sufficient information within a single sample.

An interesting result of this is that the sample size depends on the precision you would like to achieve (or that the precision of your estimates depends on the sample size), but it does not depend on the population size, provided that your population is large enough. The sample size is a figure, not a percentage!! For small populations (and ‘small’ here means up to a few hundred) a correction applies, so to achieve a certain precision we won’t need sample sizes greater than the population size!

Confidence Interval

A confidence interval (CI) gives an indication of the degree of uncertainty of an estimate and helps to decide how precise a sample estimate is. It specifies a range of values likely to contain the unknown population value. These values are defined by lower and upper limits, and the probability that the Confidence Interval includes the true population value is given by the confidence level.

The width of the interval depends on the actual precision of the estimate and the confidence level used. A greater standard error will result in a wider interval; and the wider the interval, the less precise the estimate will be.

A 95% confidence level is by far the most commonly used. However, it is possible to calculate the CI at any given level of confidence, such as 90% or 99%. But why 95%? It was set by the father of modern statistics himself, Sir Ronald Fisher. In 1925, Fisher picked 95% because the distance from the centre of the Confidence Interval and the upper or lower limit is almost 2 times (1.96 times, to be precise) the standard error of the estimate. This threshold has since persisted for almost a century. It is a good compromise between being almost sure (95%) that it includes the population value, and not being so extremely wide that it wouldn’t be useful for estimation purposes.

For a confidence level of 95%, if we drew 20 random samples and calculated a 95% confidence interval for the mean for each sample using the data in that sample, we would expect that, on average, 19 out of the 20 (95%) resulting confidence intervals would contain the true (and unknown) population value, and 1 in 20 (5%) would not. So, most of the time, but not always. If we increased the confidence level to 99%, wider intervals would be obtained to increase the probability of including the true value, and only one out of 100 times would we get a confidence interval that does not include it.

To calculate confidence intervals around an estimate, we use the standard error (or the margin of error) for that estimate. The estimate and its 95% confidence interval are presented as: the estimate plus or minus the margin of error. The lower and upper 95% confidence limits are given by the sample estimate plus or minus 2 standard errors.

For the purposes of this demonstration, we are going to use two examples from the Humpback Whale data set. The number of Humpback whales in the data set is 23,581, large enough to assume that the characteristics of interest
in the sample of Humpback whales (length and sex) are essentially the same as those in the population.

Let’s have a look at the data

The two variables we are interested in are the sex of the whale and its length. The table below shows the number and percentage of whales of each sex.

sex	Number	Percentage
Male	14055	59.7%
Female	9469	40.2%
Hermaphrodite	1	0.0%
Unknown	25	0.1%

There are roughly 60% male and 40% female whales. Does this result show that these are, approximately, the true proportions of each sex? Or does it show that it is easier to catch male than female whales? Or does it show that the whalers were more interested in male than in female whales? We don’t know, but for the purpose of this exercise, we will assume that these are the true proportions of male and female whales: 59.7% are male and 40.2% are female. Note also that for 25 whales out of 23,581 (0.1%) the sex could not be identified, and 1 whale was hermaphrodite!

The table below summarises the length of the whales

Statistic	Value
Number of whales	23,550.00
Mean length	13.30
Median length	13.10
Minimum	5.80
Maximum	19.50
Range	13.80
Standard Deviation	1.40
Variance	2.00

Take a look at the graphical display of the length in the interactive plot below

Type of Visualisation

Let’s assume that the 23,581 whales reported in the dataset are the population of whales on the Australian shores, and let’s see what happens if we take samples from it and observe how the sample mean of the length and the proportion of male/female behave as we obtain a new sample. The sampling method used is Simple Random Sampling, each unit has exactly the same probability of being selected.

Sample Size

Confidence Level

Plot Type

You can also change the sample size. By increasing it, the sample estimates will be closer and closer to the actual population value. The 95% confidence interval changes the centre (because it is the sample mean), but the width change is very small.

You can also change the confidence level. If you increase it from 95% to, say, 99%, you will observe how the confidence interval gets wider, and if you set the confidence level at 90%, the confidence interval will get narrower.

Now, let’s observe what happens with the 95% Confidence intervals. You can generate samples of size n from the population of whales using a Simple Random Sampling process. For each sample, the 95% Confidence interval will be generated and, until you reset, the confidence intervals are accumulated in the plot.

Sample Size

Confidence Level

As the sample size grows, again, the 95% confidence interval will get narrower. For example, if you generate a sample of size n=50 and generate the 95% Confidence interval and then generate a new sample of size n=4*50=200, the width of the new sample’s 95% Confidence interval will be a half of the first one, as the width of the Confidence interval is proportional to 1/sqrt(n).

95% of the 95% Confidence intervals, or 19 out of 20, should include the population mean, that is represented by the horizontal line in the plot. If you keep generating samples and their corresponding 95% CI long enough, you will find that one Confidence Interval does not include the population mean. To identify it easily it is red instead of the usual black. This is a probabilistic statement, so it may happen in your first sample or may not happen until you have generated 30 samples, but in the long run, the proportion of 95% confidence intervals that do not include the population mean will be around 5% or one out of 20. And this will happen no matter how large your sample size is. Sample size affects the precision of the estimate, the width of the 95% confidence interval, but, by its definition, the 95% Confidence interval will contain the true population value for 95% of the random samples.

Let’s suppose we selected a sample of n=400, the mean length is 13.39m, and the 95% Confidence interval [13.257;13.534]. The true average length value, that will usually be unknown, is 13.30m, which is included in the confidence interval. However, there are lots of other values, infinite to be precise, that are included in the confidence interval. According to our sample and the confidence interval we obtained from it, the mean population length of the Humpback whales could be 13.27m, or 13.41m, or 13.50m or any other value within the interval limits. And there is still a small chance, 5%, that the true population mean length is outside the limits, either way. What if we generated the 99% confidence interval instead from the same sample? the 99% CI would be [13.213;13.578]. Now, there is just 1% probability that the true value lies beyond the CI limits, but this comes at a cost: the 99% CI is wider than the 95% CI.

The confidence interval and the confidence levels are both related to the p-value that we are going to present in Workbook 3.

Bootstrap Confidence Interval

The computation of the 95% Confidence Interval as we have seen it comes from the fact that the mean of a sufficiently large sample size (n larger than 5 could be enough) is normally distributed, according to the Central Limit Theorem. The previous statement is also valid for proportions, although we would need larger sample sizes.

There is a range of methods that do not require normality, and can work in any circumstances. These methods are based on resampling: the creation of new samples based on one observed sample. One of such methods is Bootstrapping.

The bootstrap creates multiple resamples (with replacement) from a single set of observations, and obtains the (bootstrap) sampling distribution of the characteristic of interest in the population (mean, proportion, …). It is, somehow, the equivalent of taking many different samples and observe how the characteristic of interest is distributed. Hence the name of bootstrap, the magical idea of pulling oneself up by one’s own bootstrap.

Let’s go back to the whale dataset. We have a sample of 400 whales obtained by Simple Random Sampling. We can, then, resample n=200 units from it with replacement, meaning that we can have the same whale more than once in the first resample. We than compute the estimate of interest: the mean, or the proportion of males. We, then, repeat the process a second time and obtain a second estimate, and then a third, and so on, up to a large number of resamples, for example 5000. These 5000 resample means (or proportions) can be plotted in a histogram, and we can also get the two values that include the central 95% of the resample estimates, leaving the 2.5% most extreme in both sides out. This is the 95% bootstrap Confidence Interval.

The figure below shows the 95% Bootstrap Confidence Interval for the whale length population mean.

The 95% Bootstrap CI computed with the original sample of n=400 is [13.201; 13.587], while the 95% CI computed using the normal distribution is [13.257;13.534], and the original sample mean is 13.396. We could also compute the 99% bootstrap CI by leaving out the most extreme 0.5% observations in both sides: [13.213; 13.578], or a Confidence Interval for any other confidence level.

Both 95% CI, the classical one that uses the normal distribution, and the bootstrap CI,

contain the true population value 13.289,
were obtained from the original sample of n=400

The advantages of using a Bootstrap Confidence Interval is that there is no need to assume that our observations, or the underlying populations, are normally distributed. It can be used always, even if our data is normally distributed.

Sampling strategies in environmental sciences

In this section, we are going to explore some of the common sampling methods in environmental sciences. The following video will also take you through some of the most relevant features of the methods described below.

Stratified Sampling

Stratified Sampling: is a probabilistic sampling method that involves the division of a population into smaller homogeneous subgroups known as strata prior to sampling. In stratified random sampling, or stratification, the strata are formed based on shared attributes or characteristics such as soil characteristics, slope, age or sex of individuals. Simple random samples are then taken within each stratum. It is also called proportional random sampling or quota random sampling.

Strata should be defined carefully, based on the response of the attribute you are estimating to habitat characteristics that are unlikely to change over the time of the study

• If there is no obvious reason for the differences in your area under investigation, you are probably better off using a simple random sampling procedure

• Sampling units do not have to be allocated in equal numbers to each stratum. Some options on how to allocate sub-sample sizes to each stratum are:

Equally to each stratum;
In proportion to the size of each stratum;
In proportion to the number of target individuals in each stratum;
In proportion to the amount of variability in each stratum,

and the sampling method used to obtain the subsamples within each stratum can be any that is considered suitable.

An example of how we could do the sampling in a fish survey: in location A, there are good conditions and a high density of fish; location B, however, has less favorable conditions, and consequently fewer fish. Since we would like to have enough fish from each site, we may choose to stratify to over-sample from location B to be able to understand this environmental gradient more.

The method is useful when the attribute of interest responds very differently to some clearly defined habitat structure or when we want to make sure that some areas/characteristics such as sex or age are included in the sample. The recommendations within strata are the same of simple random sampling

Pros of stratified sampling:

it provides a systematic way of gaining a population sample that takes into account the demographic make-up of the population, which leads to stronger research results.
The method is fair for population units as the sample from each stratum can be randomly selected, meaning there is no bias in the process.
As participant grouping must be exhaustive and mutually exclusive, stratified random sampling removes variation and the chances of overlap between each stratum.
Lastly, it helps with efficient and accurate data collection. Having a smaller, more relevant sample to work with means a more manageable and affordable research project.

And the Cons:

Researchers may hold prior knowledge of the population’s shared characteristics beforehand, which increases the risk for selection bias when strata are defined.
There is more administration to do to conduct this process, so researchers must include this extra time and order.
When randomly sampling each stratum, the resulting sample may not be representative of the full population. This will be addressed by the use of sampling weights during the analysis, a further complication.
Once you have the final sample, data analysis of the information becomes more complicated to take into account the layers of the stratum.
for the same sample size, stratified designs yield estimates with lower precision than SRS.

The analysis of data sampled using stratified sampling requires to apply weights when estimating something across the whole population. Getting back to the fish survey, if we select equal subsamples in the two site, the probability to select a fish from oversampled site B will be higher than the probility to select a fish from undersampled site A.

Cluster sampling

Cluster sampling is another probabilistic sampling technique. It involves dividing a population into clusters or groups, selecting a sample of clusters, and then sampling individuals or units within those clusters. The primary purpose of cluster sampling is to simplify the sampling process while still ensuring a representative sample of the population.

Clusters are formed based on specific characteristics relevant to the research study, such as geographical location, administrative boundaries, or organisational structure. Clusters should be randomly selected from the population to ensure unbiased representation. Randomisation helps minimise selection bias and ensures that every cluster has an equal chance of being included in the sample. Clusters should ideally be internally homogeneous, meaning that individuals or units within the same cluster are similar or alike in relevant characteristics. This helps improve the sampling efficiency and enhances the representativeness of the sample. Clusters should be independent of each other to avoid duplication in sampling and ensure that each cluster contributes unique information to the sample. Overlapping or dependent clusters can lead to biased estimates and undermine the validity of the findings.

The sampled clusters become the Primary Sampling Units (PSU) from which the individuals or units will be selected, becoming the Secondary Sampling Units (SSU), using any of the methods described.

Cluster Sampling Advantages:

Efficiency: Cluster sampling is often more efficient than other sampling methods, especially when the population is large or geographically dispersed.
Cost-Effectiveness: By sampling entire clusters instead of individual units, you can reduce costs associated with recruitment, data collection, and analysis.
Logistical Feasibility: Cluster sampling simplifies the sampling process, making it easier to manage and execute, particularly in field-based or community-based research.

Cluster Sampling Disadvantages

Increased Variability: Due to the clustering of individuals within clusters, there is a risk of increased variability in the sample estimates compared to simple random sampling. For a given sample size it yields estimates that are less precise than those obtained through Simple Random Sampling.
Complex Analysis: Analyzing cluster sampling data requires specialised statistical techniques to account for the clustered nature of the sample, which can be more complicated than analysing data from simple random samples.

Systematic sampling

Systematic sampling is yet another probability sampling method in which researchers select members of the population at a regular interval (or k) determined in advance. If the population order is random or random-like, then this method will give you a representative sample that can be used to draw conclusions about the population.

The pros of systematic sampling are that it’s easy to use, While it retains an element of randomness, it also introduces some control and process into the selection process, it’s low risk and it’s resistant to bias, as after the initial starting point is selected, researchers have little control over who/what gets selected for systematic sampling. The cons are related to the fact that Systematic sampling can remove some of the unpredictability from a sample, meaning that a researcher could potentially manipulate the results by choosing a starting point that favours their preferred outcome.

With Systematic sampling, we still wish to sample a fraction of 1 in k, so if the population size N is unknown, it must be estimated. In this case, the initial study unit is randomly selected from the first k units that become available, for some k. After this, each kth consecutive unit is chosen. For example, we might sample the 3rd person entering a healthcare clinic on a certain day, and every 10th person thereafter. The sampling frame is compiled as the study progresses.

Transect sampling

Transect sampling is a method used to study the distribution and abundance of organisms along a line or pathway. The transect is simply a line that spans the gradient of interest, and then you locate your sample plots along this line (see the figure below). The length of your transect would be determined by the gradient you are sampling. For example, research has shown that the “edge effects” of a forest can be seen up to 100 metres into the forest, so to examine this gradient you would want to use a transect at least 100 metres long.

Locating sample plots along the transect either in a uniform manner (i.e. every 25 metres) or by “randomly stratifying” the sample plots. Random stratification adds the element of randomness to your sample plot select even when using a transect. To randomly stratify your sample plots, divide the transect into logical subdivisions given the length of your transect (i.e. every 50 metres for a 500 meter transect or every 10-20 metres for a 100 meter transect) and the pattern you are trying to examine (i.e. if you expect rapid change then perhaps more subdivisions, if you expect slow, gradual change, then perhaps fewer subdivisions). Then randomly allocate the sample plots within each subdivision of the transect (see the figure below).

Quadrat sampling

Quadrat sampling is a classic tool for the study of ecology, especially biodiversity. It is an important method by which organisms in a certain proportion (sample) of the habitat are counted directly. It is used to estimate population abundance (number), density, frequency and distributions. The quadrat method has been widely used in plant studies. A quadrat is a four-sided figure which delimits the boundaries of a sample plot. However, the term quadrat can be used more widely to include circular plots and other shapes.

The quadrat sampling method has the following assumptions:

The number of individuals in each quadrat is counted.
The size of the quadrats is known.
The quadrat samples are representative of the study area as a whole.

The pros of quadrat sampling are:

The method is easy to use, inexpensive.
It is suitable for studying plants, slow-moving animals and faster-moving animals with a small range.
It requires the researcher to perform the work in the field.
It measures abundance and needed cheap equipment.

Some of the cons are:

Quadrat sampling is not useful for studying very fast-moving animals which are not stay within the quadrat boundaries.
There is bias in favor of slow moving taxa.
Collect only taxa that are present in the sampling time and not buried too deep in sediment.
It is a low estimate of taxonomic richness and assemblage composition.
Some animals may experience harm if the scientist collects the population within the quadrat rather than studying it in the field.

A nice summary of sampling approaches can be found in the figure below

Purposive sampling

Purposive sampling is a non-probabilistic sampling process in which a researcher selects a specific group of individuals or units for analysis because they have characteristics that are needed in the sample. This method is appropriate when the researcher has a clear idea of the characteristics or attributes they are interested in studying and wants to select a sample representative of those characteristics. So, units are not selected by a random process, but “on purpose”.

It is also known as judgemental sampling, and this method relies on the researcher’s judgement when identifying and selecting the individuals, cases, or sites that can provide the best information to achieve the study’s objectives.

Purposive sampling is best used when you want to focus in depth on relatively small samples. Perhaps you would like to access a particular subset of the population that shares certain characteristics, or you are researching issues likely to have unique cases. It is particularly useful if you need to find information-rich cases or make the most out of limited resources.

Exercises

In this set of exercises there are a few multiple choice questions, to check you have followed the key messages from this module, and then a final question which requires some more thought. We will discuss the final question in this week’s webinar - please note down through any major challenges you can think of in this process, either linked to the scenario described or ones which you have faced, or know that you may face in the future, when conducting sampling processes and we can talk through these problems during the session.

Question 1

Sex	Percentage	95% Confidence Interval
Male	60.0%	[55.0%, 65.0%]

The table above shows some results looking at the the percentage of male whales from a sample of 400 whales, that we can consider is a representative and random sample of our population of interest of all whales of a particular species.

Question 2

Question 3

A research study will assess the impacts of environmental contamination (heavy metals, POPs…etc) on wildlife along the Kobuk river system, Alaska. The study will involve environmental sampling of the system, as well as wildlife sampling of species thought to be affected, especially salmon. The researcher will then be looking for links between pollutants and effects on wildlife health and on the ecosystem.

Question 4

The best way to gain an understanding of a sampling and estimation method is to carry it out on some real population of interest to you. In the process of carrying out the sampling and obtaining the estimates, think about or discuss with others the following

a- What practical problems arise in establishing a frame, such as a map or a list of units, from which to select a sample

b- how would you carry out the sample selection?

c- What special problems arise in observing the units selected?

d- How would you estimate the population characteristics of interest (mean? proportion? …)

e- How would you estimate the precision of your estimate?

f- Can you think of any way of improving your procedure?

External Links and Resources

Webster, R. and Lark, M. (2013) Field Sampling for Environmental Science and Management (This book provides guidance on the design and analysis of sampling method, backed by sound rationale and statistical theory. It concentrates on design-based sampling for estimates of mean values of environmental properties, emphasizing replication and randomization.)

Populations and samples

Confidence Intervals

Sampling and Estimation