This is the first in a series of posts that aim to make clear a few commonly used phrases in conversion Rate Optimization Statistics and debunk a few myths about what you can and can’t infer from your test statistics. This article uses examples from A/B testing tool Optimizely but the explanations can apply to any statistics in testing.
Table of contents:
Part one: confidence intervals & confidence limits in testing
Ever seen these signs on an A/B split test and wondered what they mean? The values to the right of the conversion rate are what we call ‘confidence intervals’. These are the values that (when added and subtracted from the test conversion rate) give us confidence limits. Confidence limits are basically a range wherein we can say with relative safety the ‘true’ value of the conversion rate lies.
Wait…what?!
Okay, let me put it another way. Let’s assume we’re testing to a 95% confidence level, the confidence limits on ‘Variation #1’ above mean; “I’m 95% confident that from this test we can say the exact value of the conversion rate of Variation #1 lies between 2.87% (which is 3.26% – 0.39) and 3.65% (which is 3.26% + 0.39)”
That’s all well and good but why do we need these?
The reason we need confidence intervals and limits is because it would be impossible to run a conversion optimisation test on the whole of population. Therefore, when we test a sample of the population, we can’t assume that they will behave in a way that represents the whole population. What we can assume is that the sample population provides an estimation of how the whole population would behave.
So we can estimate how all our users will act, what next?
Now we can start comparing the confidence intervals of the ‘original version’ and the ‘variation version’. If there is no overlap of the confidence intervals between the original and the variation (as with our first example above), it is very safe to assume that ‘Variation #1’ will increase the conversion rate over the original, assuming we have tested with a suitable sample size and let the test run for (in our opinion) at least two ‘business cycles’.
How can you say that?
Take a look at the below. This is a (very crude) illustration of the results. As you can see, the most extreme low (or confidence limit if you’re being fancy) of the variation is still higher than the most extreme high of the original. So even in the unlikely event, the true value of the original and variation were the high and low respectively, the variation would still win. Now, this is a fairly extreme example but gives an idea as to how you can use confidence intervals to interpret your test data.
*absolutely, in no way, to scale
The great thing about confidence intervals is that they provide an alternative way of visualising your test results but also provide extra information on the largest and smallest effects you can expect.
Do you use confidence intervals and limits? If so, leave a comment and let us know how you use them. We’d love to hear from you.
Part two: statistical significance
Probably the most used and most contested CRO term of them all, is statistical significance. How many times have you quoted a test running to ’95% statistical significance? 95 times? 100 times? But what does it actually mean? Is it a good thing?
Firstly, before we delve into the glossary, a brief introduction. In a lot of CRO testing, we agree that a test is significant (i.e, the results are significant enough for us to assume the probability of it being a fluke is low enough to accept the test result) when it reaches 95% statistical significance and 80% statistical power. This will make more sense later on.
So, let’s dive straight in. Statistical significance measures how often, when the variation is no different from the original in terms of results, the test will say so. So when you test to 95%, you are basically saying if I ran this test 20 times, 19 times it would show no difference (assuming there is no difference).
Wait, what?!
Okay, let me put it another way. Say we have tested the original landing page of a company, for namesake we’ll call it Frank’s Merchandise. We know that Frank’s homepage converts 3% of traffic into sales. Say we then test a variation of the home page, where we know the true value of that conversion is also 3%. Despite them being the same, when we tested them against each other 5% of the time the test would show that one version was better, which is also called a false positive result.
Great!
Or is it? One in 20? Near twice as likely as rolling a double six when playing monopoly? Put another way, if we tested two versions of exactly the same webpage against each other, 1 out of every 20 tests would say that one variation outperformed another. Is that good enough for us?
But what’s statistical power, and what’s the difference between the two?
Now, statistical power is kind of the opposite of statistical significance. Statistical power measures how often, when a variation is different to the original, the test will pick up on it. The industry standard is 80% statistical power, which again, basically means if I ran a test where I knew categorically there was a difference in conversion between the original and the variation, the test would pick up on that significant difference 8/10 times. So 2/10 times the test would fail to pick up on the difference between the two versions, even though there is one. This is also called a false negative.
But surely statistical significance and statistical power are industry standards set at 95% and 80% respectively for a reason?
Erm, actually no. There is actually no basis behind 95% for statistical significance, nor 80% for statistical power. They are simply the most commonly used in statistics, particularly medical statistics, although a wider range of different values is often used. One suggestion as to why statistical significance is set higher than statistical power is that it is riskier. Certainly in medical terms, it is far more damaging to implement a new drug treatment that is actually less effective than the control drug than to not implement a more effective drug treatment. But in CRO, which would you say is more damaging to a business:
Not implementing a variation that increases conversion (a false negative)
or
Implementing a variation that has no effect on the conversion? (a false positive)
I hope that’s given you more of a grounding into what a lot of CRO hypothesis testing is based around and maybe sprung up a few more questions. I’d like to leave you with two questions to ask about your business:
- What do you hold more value towards? Never implementing a variation that doesn’t hold any difference, or always implementing variations you know will provide results?
- With this information, would you change the levels of statistical significance and statistical power you test?