A common idea in research and evaluation is the concept of statistical significance. We’re familiar with the use of significant as in ‘considerable’, ‘large’ or ‘relevant’. But how to think of statistical significance?

We can think of statistical significance as a measure of the confidence we have in an observed outcome as against an alternative outcome. As an example, let’s say we want to know whether the average number of M&Ms in a packet bought in the North Island is the same as that in the South Island.

To run the experiment we simply buy up packets of M&Ms in each island (our two samples) and count the number of M&Ms in each packet. From this, we will get two key pieces of information, the average count and how the count varies from packet to packet. Let’s say we sample 100 packets from each island and get an average count of 65 M&Ms in the North Island, and 55 M&Ms in the South. Can we say that there is a true difference in M&M packets between the islands (perhaps they come from different factories?) Put another way, is the result ‘significant’ or could it just be due to chance?

There are two key factors in determining how confident we would be that there is a real difference between the samples: the number of packets we count in each sample and how the count varies. It’s fairly intuitive that the more packets we count, the greater our confidence in the result. Similarly, if the count does not vary much among packets, then we would have greater confidence in the final averages than if the count bounced around a lot.

To do the maths, we imagine that the samples came from the same factory (in formal language, we call this the *null hypothesis of zero difference*). In this scenario, getting a different average number of M&Ms in the North vs. South Island packets is considered to happen just by chance – i.e. perhaps we just bought packets with lower than average M&Ms in the South Island compared to the North.

This is the essence of statistical significance– in our case, it is the probability that the two samples of M&Ms were from the same factory (considering the number of packets we measured and the count variation we observed). If the maths shows us that the probability of the results coming from the same factory is very low, then we can say with some confidence that they actually came from different factories (formally, we *reject the null hypothesis*). The usual rule of thumb in research is a probability (or *p*-value) less than 0.05—this means we could expect these results to happen by chance 1 time every 20 experiments.

Of course, this threshold is somewhat arbitrary and it’s important to remember a p-value below this threshold doesn’t guarantee that there is a real difference (it could still be due to chance!). Conversely, a p-value above 0.05 doesn’t mean there definitively isn’t any difference, just that we haven’t been able to measure one in our particular experiment. Nowadays it is emphasised that all p-values should be reported regardless of significance and that we must carefully consider the experiment details to give context to these values.

If you’re interested in this topic, here’s a primer that you might enjoy: