A/B testing guide

How to run experiments that actually answer your questions

Most A/B tests are run incorrectly — wrong sample size, peeking at results early, or testing the wrong thing. Here is how to design, run, and interpret A/B tests correctly so you can make confident product decisions.

What A/B testing actually is

An A/B test is a controlled experiment where you randomly split your user population into two groups. Group A sees the control (the current experience). Group B sees the variant (the change). You measure a defined metric on both groups and determine whether the difference is statistically significant — i.e., not due to chance.

The random split is what makes A/B testing powerful. It controls for everything else — time of day, device type, user segment — so the only systematic difference between the groups is the change you are testing.

When to A/B test

A/B testing is a powerful tool — when used in the right situation.

You have a hypothesis about a specific change

A/B tests answer 'does this change improve metric X?' — not open-ended questions. Your hypothesis must name the change, the metric, and the expected direction.

You have enough traffic

Usually 1,000+ users per variant per week at a minimum. If you cannot reach that, your test will never reach significance — you will just be guessing with extra steps.

The decision will repeat

A/B testing makes sense for decisions you make many times (pricing page layout, onboarding flow, email subject lines). For a one-off strategic bet, run a qualitative study instead.

When not to A/B test

Traffic is too low

You will never reach significance. The test result will be noise. Running it anyway and stopping when the number looks good is worse than not running it at all.

The change is strategic, not tactical

Whether to enter a new market, pivot the business model, or sunset a product — these are strategic decisions. They should not be decided by an experiment. Use competitive analysis, customer discovery, and leadership judgment.

You need to know why

A/B tests tell you what happened. They do not tell you why. If conversion dropped 4% in the variant, the test will not explain the reason. Pair experiments with session recordings, user interviews, and surveys.

Sample size and statistical significance

The most common A/B testing mistake: running a test too short and peeking at results. If you check results daily and stop the test when you see significance, you have a false positive rate far above 5% — you are fooling yourself into a confident but wrong decision.

Step 1

Calculate required sample size before you start

Use a power calculator (statsig.com has a free one). You will need three inputs: your baseline conversion rate, the minimum detectable effect (the smallest improvement you would care about), and desired statistical power (80% is the standard — it means you have an 80% chance of detecting a real effect if one exists).

Step 2

Set the duration before you start — then do not change it

Based on your sample size calculation and weekly traffic, determine how long the test must run. Commit to that duration before you launch. Run the test for its full predetermined period. Do not stop early — not when the result looks good, and not when it looks bad.

Why peeking is dangerous

At a 5% significance threshold, if you check results every day for 20 days and stop the moment significance is reached, your true false positive rate is closer to 30%. You will ship changes that do not actually work roughly one in three times.

How to interpret results

Significance is not a binary yes/no. These three concepts tell you what you actually know after a test completes.

p-value

The probability you would see this difference by chance, assuming there is no true underlying difference. A p-value below 0.05 means there is less than a 5% chance the result is a fluke. This is the standard significance threshold.

Watch out

p < 0.05 does not mean the variant is good — it means the difference is unlikely to be random. Always also check whether the effect size is meaningful.

Confidence interval

The range of values within which the true effect likely falls. A result of '+3.2% conversion (95% CI: +0.8% to +5.6%)' tells you the effect is real and gives you a range to plan around.

Watch out

A wide confidence interval (e.g., -1% to +7%) means you do not have enough data yet. The test needs more time, not a decision.

Statistical significance vs. practical significance

A change can be statistically significant (unlikely to be random) but practically insignificant (too small to matter). A 0.1% improvement in conversion that required 6 weeks to detect may not be worth shipping.

Watch out

Always ask: is this effect size meaningful for the business? Will it move revenue, retention, or user satisfaction in a way that justifies the engineering cost?

Report results as confidence intervals, not just p-values. A result of “+3.2% conversion (95% CI: +0.8% to +5.6%)” is more useful than “p = 0.03.” The interval tells your team what to expect if you ship — and whether the lower bound of the range is still worth the investment.

Keep learning

Learn product metrics

A/B testing and metrics go hand in hand. Once you know what to test, you need to know which numbers actually matter — and which are noise.

Learn product metrics