How to Design Your First A/B Test: An 8-Step Framework for Beginners

Designing your first A/B test requires more than splitting traffic and comparing numbers. To get results you can actually trust, you need a clear hypothesis, the right metrics, a calculated sample size, and a structured process for analyzing and documenting what you find. This guide walks through the complete 8-step framework for designing, running, and analyzing your first A/B test, and building an experimentation practice your team can scale.

‍

Prefer to watch? See the full framework on YouTube or download the free playbook to use as a reference.

‍

Most teams approach their first A/B test the wrong way.

They have an idea, they build two versions, they split traffic, they check which number is higher, and they ship the winner. It feels data-driven. It is not.

Without a documented hypothesis, you do not know what you are testing. Without the right primary metric, you are measuring the wrong thing. Without a calculated sample size, you are making decisions on data that is not statistically reliable. And without documentation, the learning disappears the moment the experiment ends.

The 8-step framework below fixes all of that. It is the same process used to run experimentation programs for growth-stage SaaS companies, and it works whether you are running your first test or your fiftieth.

‍

Step 1: Identify What to Test

The starting point is not an idea. It is a signal.

Good experiment ideas come from four sources. Analytics data reveals where users are dropping off in your funnels: these drop-off points are your highest-opportunity areas for testing. User feedback from customer interviews, support tickets, and reviews surfaces the pain points users are telling you about directly. Heatmaps and session replays identify UX and UI friction points that are invisible in standard conversion data. And business goals give you a direction, as if your objective is to increase trial-to-paid conversion, every experiment should connect back to that.

Start by listing the biggest friction points across these four sources. Do not test randomly. Test where the data tells you impact is most likely.

‍

Step 2: Prioritize Using the ICE Framework

Once you have a list of potential tests, you need a way to decide which one to run first. The ICE framework gives you a consistent, objective scoring method.

Score each experiment idea on three dimensions, each out of 10. Impact is how large the improvement could be if the hypothesis is correct. Confidence is how certain you are that the hypothesis will produce a positive result, based on the evidence you have. Ease is how simple the test is to implement given your current engineering and design resources.

Multiply the three scores together. The highest-scoring experiments go to the top of your roadmap. This removes opinion from prioritization and replaces it with a structured, repeatable process.

‍

Step 3: Write a Strong Hypothesis

A hypothesis is not a description of what you are changing. It is a prediction of what will happen, why it will happen, and how you will measure it.

The formula is: if we make this change, then this outcome will occur, because of this evidence.

A weak hypothesis looks like: changing the button color will increase conversion. There is no direction, no magnitude, and no rationale.

A strong hypothesis looks like: if we change the CTA button from blue to green and increase the size by 20%, then click-through rate will increase by at least 5%, because user interviews showed that customers were not noticing the current button on mobile screens.

The strong version tells you exactly what you are testing, what result you expect, and why you believe the change will work. It also makes the test interpretable, when you come back to evaluate the results, you have a clear standard to evaluate against.

‍

Step 4: Choose the Right Metrics

Every experiment needs one primary metric and three to five secondary metrics.

The primary metric is the single most important measure of whether your hypothesis is correct. It should be directly tied to your business goal and sensitive enough to detect a meaningful change within a reasonable sample size. Revenue per visitor, conversion rate, add-to-cart rate, and activation rate are common primary metrics depending on what you are testing.

Secondary metrics serve two purposes. Guardrail metrics sit upstream or downstream of your primary metric and ensure the variant is not improving your primary metric at the cost of something else. Diagnostic metrics help you understand what is driving the result. Choosing these before the experiment starts is what separates rigorous experimentation from post-hoc rationalization.

If you are unsure how to evaluate these metrics once results come in, the A/B testing statistics guide covers p-values, confidence intervals, and sample size in plain English.

‍

Step 5: Calculate Your Sample Size

This is the step most beginners skip, and it is the most common reason experiments produce unreliable results.

Sample size is the number of users each variant needs to be exposed to before your results are statistically reliable. You calculate it before the experiment starts using four inputs: your baseline conversion rate (the current performance of the control), your minimum detectable effect (the smallest lift you consider worth detecting), your desired confidence level (typically 95%), and your desired statistical power (typically 80%).

Amplitude's sample size calculator is one of the best available tools for this. Enter your inputs and it tells you how many users per variant you need and approximately how long the test needs to run based on your current traffic.

If the required duration comes out to three months and you only have two weeks of patience, the answer is not to run the test anyway. The answer is to raise your MDE (accept that you will only detect larger effects) or change your primary metric to one that reaches significance faster.

Want the full framework as a reference document?

The A/B test design playbook covers all 8 steps with worked examples, hypothesis templates, and a documentation framework you can use immediately.

Download the free playbook

‍

Step 6: Launch and Run the Test

Once your sample size is calculated and your variants are built, launch the test in stages rather than all at once.

Start by rolling out to 25% of your target users. Monitor for technical issues like broken events, unexpected behavior, data not flowing correctly. If everything looks healthy, expand to 50%, then 75%, then 100%. This staged rollout reduces the risk of a broken variant affecting your entire user base.

Once at full traffic, wait. Do not check results daily and make decisions based on early data. Early results are statistically unreliable and will frequently reverse as more data comes in. This is called the peeking problem -- and it is one of the most consistent causes of false winners in A/B testing. The guide to reading A/B test results covers exactly how to avoid it.

The exception is a serious negative result. If a guardrail metric drops significantly, pause the experiment and investigate before continuing.

‍

Step 7: Analyze Your Results

When the experiment has run for its full duration and reached its target sample size, evaluate the results using a consistent framework.

Check statistical significance first. If your p-value is below 0.05 and your confidence level is at or above 95%, the result clears the significance threshold. If not, the result is inconclusive, do not ship based on it.

Check the confidence interval. The interval should sit entirely above zero for a positive directional result. If the lower bound crosses zero, extend the test. A p-value below 0.05 with a confidence interval crossing zero is not a clean result.

Check statistical power. Power at or above 80% confirms the test had enough sensitivity to detect a real effect. If power is low, a negative result does not mean the hypothesis was wrong, it may mean the test was underpowered.

Check segment consistency. Does the lift hold across new users and returning users? Across mobile and desktop? An overall positive result that is driven entirely by one segment needs to be understood before a shipping decision is made.

‍

Step 8: Document Everything

Documentation is what transforms a single experiment into a compounding knowledge base.

Every experiment should be documented with the same structure: the hypothesis you started with, the primary and secondary metric results, the time period the test ran, any segment-level observations, the decision you made (ship or kill), and the reason for that decision.

This documentation serves two purposes. In the short term, it keeps your team aligned on what was tested and why decisions were made. In the long term, it becomes the raw material for generating new hypotheses, patterns emerge across experiments that point directly to the next test worth running.

Teams that skip documentation run the same experiments twice, lose the context behind past decisions, and never build the compounding learning base that makes experimentation programs genuinely valuable.

‍

5 Common A/B Testing Mistakes to Avoid

Stopping too early. Impatience is the most common source of false winners. Run tests to their full duration and sample size. no exceptions unless a guardrail metric triggers a pause.

Testing too many things at once. Running multiple changes in a single variant makes it impossible to know which change drove the result. Test one thing at a time.

Running without a hypothesis. A test without a hypothesis is not an experiment, it is a guess with extra steps. Write the hypothesis before you build anything.

Choosing the wrong primary metric. The primary metric should be tied directly to your business goal and sensitive enough to reach significance in a reasonable timeframe. A metric that takes six months to move is not a useful primary metric.

Not documenting results. Two quarters from now, your team will not remember what they tested or why they made the decisions they made. Document every experiment regardless of whether it won or lost.

‍

Download the free A/B test design playbook

The full 8-step playbook gives you hypothesis templates, a sample size reference, metric selection guidance, and a documentation framework: everything you need to run your first experiment properly.

Download the free playbook

‍

Want to build a scalable experimentation program?

If you want to go beyond your first test and build an experimentation practice your whole team can run independently, have a look at our program.

Read the program

‍

FAQ

How do you design an A/B test from scratch?

‍Start by identifying what to test using analytics data, user feedback, heatmaps, and business goals. Write a structured hypothesis using the formula: if we make this change, then this outcome will occur, because of this evidence. Choose a primary metric tied to your business goal, calculate your required sample size before launching, and document the results regardless of outcome.

What is the ICE framework for A/B test prioritization?

‍ICE stands for Impact, Confidence, and Ease. Score each experiment idea out of 10 on all three dimensions and multiply them together. The highest-scoring experiments go to the top of your roadmap. It is a fast, consistent way to prioritize without letting opinions or seniority drive the decision.

How do you write a good A/B test hypothesis?

‍A strong hypothesis follows the formula: if we make this change, then this outcome will occur, because of this evidence. It specifies what is changing, what metric you expect to move, by how much, and why you believe the change will work. A weak hypothesis says "changing the button color will increase conversion." A strong one says "if we change the CTA from blue to green and increase its size by 20%, click-through rate will increase by at least 5%, because user interviews showed the current button was not visible on mobile."

What is minimum detectable effect in A/B testing?

‍The minimum detectable effect (MDE) is the smallest improvement your test is designed to reliably detect. A smaller MDE means higher sensitivity but requires a larger sample size and longer test duration. Set your MDE based on what improvement would actually be worth shipping and maintaining -- not the smallest possible number.

How long should you run an A/B test?

‍Run the test until it reaches its target sample size and has captured at least one full business cycle -- including any weekly patterns like weekday versus weekend behavior. The required duration comes from your sample size calculation. Stopping early because results look positive is the most common cause of false winners.

What should you document after an A/B test?

‍Document the hypothesis, primary and secondary metric results, test duration, segment-level observations, the decision made (ship or kill), and the reason for that decision. This documentation becomes the foundation for generating future hypotheses and keeps your team aligned on what has already been tested.

‍

On this article

How to Design Your First A/B Test: An 8-Step Framework for Beginners