A/B Testing Statistics Explained: P-Value, Confidence Intervals and Sample Size

Understanding the statistics behind your A/B test results is what separates teams that make reliable decisions from teams that ship false winners. P-value, confidence intervals, sample size, and statistical power are not just technical concepts, they are the tools that tell you whether your experiment result is real or noise. This guide explains each one in plain English and shows you exactly how to use them together when evaluating your results.

Most teams running A/B tests understand the concept. You split traffic, show users two different experiences, and measure which one performs better.

What most teams don't understand is the statistics that tell you whether the result you're seeing is actually real.

This is where false winners come from. Not from bad ideas or wrong hypotheses, but from misreading numbers that look positive but aren't statistically reliable. A result that looks like a 4% lift on day five might be noise. A p-value below 0.05 might not be enough on its own. A confidence interval that crosses zero means you don't have a clear direction yet.

Understanding these concepts doesn't require a statistics degree. It requires knowing what each number is telling you.

‍

What Is Statistical Significance in A/B Testing?

Statistical significance is the measure of confidence that the difference you're seeing between your control and variant is not due to random chance.

When you run an A/B test, both groups will show some natural variation in behavior even if your change had zero effect. Statistical significance tells you how likely it is that the difference you observed would happen by chance alone.

The standard threshold most teams use is 95% confidence, meaning there is only a 5% probability that the result occurred by chance. This is expressed as a significance level of 0.05.

Statistical significance is necessary for a reliable result. But it is not sufficient on its own. You also need to check your confidence interval, your sample size, and whether the lift is practically meaningful.

‍

What Is a P-Value and How Do You Read It?

The p-value is the probability of observing a result as extreme as the one you measured, assuming there is actually no difference between control and variant.

A p-value of 0.03 means there is a 3% probability that your result happened by chance. A p-value of 0.08 means there is an 8% probability. If your significance threshold is 0.05, a p-value of 0.03 clears it and a p-value of 0.08 does not.

The critical mistake most teams make is treating the p-value as the only number that matters. It is not. A p-value below your threshold tells you the result is statistically significant. It does not tell you how large the effect is, whether it is practically meaningful, or whether the lift is consistent across your key user segments.

Always read the p-value in combination with the confidence interval and your minimum detectable effect. Never in isolation.

‍

What Is a Confidence Interval and Why Does It Matter?

The confidence interval is the range within which the true effect of your change is likely to fall, at your chosen confidence level.

If your experiment shows a 4% lift with a 95% confidence interval of 1.2% to 6.8%, you can say with 95% confidence that the true lift is somewhere between 1.2% and 6.8%. That is a clean, directional result, as the entire range is above zero and above a meaningful threshold.

But if your confidence interval ranges from -0.8% to 4.3%, the lower bound crosses zero. That means the true effect could be negative. Your p-value might be below 0.05, but you do not have a reliable directional result yet. The test needs to run longer.

This is one of the most common causes of false winners in A/B testing, where teams see a positive point estimate and a significant p-value, miss the fact that the confidence interval crosses zero, and ship a variant that produces no real improvement in production. The how to read A/B test results guide covers exactly how to use confidence intervals as part of your winner declaration checklist.

‍

What Is Sample Size and Why Does It Matter Before You Start?

Sample size is the number of users each variant needs to be exposed to before your results are statistically reliable.

This is not a number you check at the end of the experiment. It is a number you calculate before you start, based on three inputs.

Your baseline conversion rate is the current performance of the control, before any change. Your minimum detectable effect (MDE) is the smallest lift you consider worth detecting, typically expressed as a relative percentage. Your desired statistical power is the probability that your test will detect a real effect if one exists, typically set at 80%.

These three inputs determine how many users you need in each variant before your results mean anything. If you stop the test before reaching that number, you are making a decision on unreliable data, regardless of what the p-value shows.

The 8-step A/B testing framework covers sample size calculation as a mandatory step before launch. Skipping it is one of the most consistent reasons teams end up with inconclusive results or false winners.

‍

What Is Statistical Power?

Statistical power is the probability that your test will detect a real effect if one actually exists.

At 80% power: the standard setting, your test has an 80% chance of producing a statistically significant result when your variant truly outperforms control. The flip side is a 20% chance of a false negative, concluding there is no difference when there actually is one.

Lower power means more false negatives. Higher power requires larger sample sizes and longer test durations. The tradeoff is between sensitivity and speed. Most teams set power at 80% and accept the tradeoff, but knowing what the number means is essential for interpreting inconclusive results correctly.

‍

What Is Minimum Detectable Effect?

The minimum detectable effect is the smallest lift your experiment is designed to reliably detect.

If you set your MDE at 5%, your experiment is designed to detect lifts of 5% or greater. A real lift of 2% might exist but your test will not reliably surface it at that MDE. You would need a larger sample size to detect smaller effects.

This matters when evaluating results because the confidence interval needs to sit entirely above your MDE, not just above zero, but for you to have a result worth shipping. A 1.2% lift with an MDE of 5% is not a clean result even if it is statistically significant. The effect is real but too small to justify the cost of shipping and maintaining the variant.

‍

How These Concepts Work Together

These statistics are not independent checks. They are a system. Here is how they connect in practice.

Before the experiment starts: calculate your required sample size based on your baseline conversion rate, MDE, and desired power. This tells you how long to run the test and when you are allowed to evaluate results.

During the experiment: do not peek at results and make decisions. Use sequential testing if your team cannot resist checking, it adjusts the significance threshold dynamically so early peaks do not inflate your false positive rate.

When evaluating results: check the p-value against your significance threshold, check the confidence interval to confirm it sits entirely above zero and above your MDE, verify that the lift holds across key user segments, and confirm that guardrail metrics are clean. All of these together, not any one in isolation, constitute a reliable result.

This is the evaluation process that prevents false winners. Teams that skip any part of it are not running experiments. They are running expensive guesses with extra steps.

‍

Why Getting This Wrong Is Expensive

Shipping a false winner is not just a missed opportunity. It is an active cost. Engineering time spent building and maintaining a variant that does not improve the metric. Analytics time spent trying to understand why the production numbers do not match the experiment results. Strategic decisions made on the basis of data that was never reliable.

The good news is that all of these mistakes are preventable with the right process. Understanding where A/B tests fail most often starts with the statistics -- and the statistics are learnable.

‍

Watch the 10MIN breakdown on Youtube

‍

Want to run experiments that produce results you can trust?

If you want to build an experimentation program with the right statistical foundations, tooling, and evaluation process in place, book a call.

Book a 30-minute call with Gregor

‍

FAQ

What is a p-value in A/B testing?

‍A p-value is the probability that your observed result happened by chance, assuming there is no real difference between control and variant. A p-value below your significance threshold (typically 0.05) means the result is statistically significant -- but it should always be read alongside the confidence interval and sample size, not in isolation.

What is a good confidence interval for an A/B test?

‍A confidence interval that sits entirely above zero confirms a positive directional result. For the result to be worth shipping, the entire confidence interval should also sit above your minimum detectable effect. If the interval crosses zero, the test needs to run longer regardless of the p-value.

How do you calculate sample size for an A/B test?

‍Sample size is calculated from three inputs: your baseline conversion rate, your minimum detectable effect (the smallest lift worth detecting), and your desired statistical power (typically 80%). Most experimentation tools and free online calculators can generate this number before you start the experiment.

What is statistical power in A/B testing?

‍Statistical power is the probability that your test will detect a real effect if one exists. At 80% power, there is a 20% chance of a false negative -- concluding no difference exists when one actually does. Higher power requires larger sample sizes. The standard setting of 80% is a deliberate tradeoff between sensitivity and test duration.

What is the minimum detectable effect in A/B testing?

‍The minimum detectable effect (MDE) is the smallest lift your experiment is designed to reliably surface. Setting a lower MDE requires a larger sample size. Setting a higher MDE allows you to run shorter tests but means you will miss smaller real effects. Your MDE should reflect the minimum improvement that would justify shipping and maintaining the variant.

What causes false winners in A/B tests?

‍False winners are almost always caused by peeking at results before reaching the target sample size, relying on p-values without checking confidence intervals, ignoring segment-level variation, or failing to account for multiple hypothesis testing when running A/B/n tests. A structured evaluation checklist prevents all of these.

‍

On this article

A/B Testing Statistics Explained: P-Value, Confidence Intervals and Sample Size