How to Calculate Sample Size for an A/B Test
Sample size calculation is the most important step in planning an A/B test. Running a test without enough visitors risks missing real improvements (a false negative), while running one for too long wastes time and traffic. The formula below tells you exactly how many subjects you need in each variation to reliably detect a given effect.
The Sample Size Formula
For comparing two proportions (e.g., conversion rates), the required sample size per group is:
n = ⌈ (Zα/2 · √(2p̄(1 - p̄)) + Zβ · √(p1(1 - p1) + p2(1 - p2)))2 / (p1 - p2)2 ⌉
Where:
- n — required sample size per variation (rounded up)
- p₁ — baseline conversion rate (your current control rate)
- p₂ — expected variant conversion rate (baseline + your minimum detectable effect)
- p̄ — pooled proportion, calculated as (p₁ + p₂) / 2
- Zα/2 — z-score for the significance level (1.96 for α = 5%, two-tailed)
- Zβ — z-score for statistical power (0.84 for 80% power)
Worked Example
Suppose your landing page converts at 5% and you want to detect a 1 percentage point lift (to 6%) with 80% power and 5% significance (two-tailed).
- Set p₁ = 0.05, p₂ = 0.06, so p̄ = 0.055
- Look up z-scores: Zα/2 = 1.96, Zβ = 0.84
- Plug into the formula: n = ⌈(1.96 × √(2 × 0.055 × 0.945) + 0.84 × √(0.05 × 0.95 + 0.06 × 0.94))² / (0.05 - 0.06)²⌉
- Result: n ≈ 4,790 visitors per variation (9,580 total)
At 500 visitors per day, this test would take about 20 days to complete.
Understanding the Key Parameters
Baseline Conversion Rate
Your current conversion rate before running the test. This is typically measured from historical data. Lower baseline rates generally require larger sample sizes because the variance of the proportion is smaller, making small differences harder to detect.
Minimum Detectable Effect (MDE)
The smallest improvement you care about detecting. This can be expressed as an absolute change (e.g., +2 percentage points) or a relative change (e.g., +10% lift). Smaller effects require exponentially more samples to detect. Choose an MDE that is both statistically detectable and practically meaningful for your business.
Statistical Power (1 - β)
The probability of correctly detecting a real effect. A power of 80% means that if the true effect is at least as large as your MDE, the test will declare significance 80% of the time. Higher power requires more samples. The industry standard is 80%, though some teams use 90% for high-stakes decisions.
Significance Level (α)
The probability of a false positive — declaring a winner when there is no real difference. A significance level of 5% means there is a 1-in-20 chance of a false alarm. Lower significance levels require more samples. The standard is 5% (α = 0.05).
One-Tailed vs. Two-Tailed Tests
A two-tailed test checks whether the variant is significantly different from the control in either direction (better or worse). A one-tailed test only checks one direction. Two-tailed tests are the default because they protect against shipping harmful changes. One-tailed tests require fewer samples but should only be used when you are certain about the direction of the effect.
Multiple Variants and the Bonferroni Correction
When testing more than one variant against a control (e.g., A/B/C testing), the chance of a false positive increases with each comparison. The Bonferroni correction addresses this by dividing the significance level by the number of comparisons. For example, in an A/B/C test with α = 5%, each of the 2 comparisons uses an adjusted α of 2.5%. This maintains the overall false positive rate at the desired level.
Frequently Asked Questions
How many people do I need for an A/B test?
It depends on your baseline conversion rate, the minimum effect you want to detect, and your chosen statistical power and significance level. As a rough guide, detecting a 1 percentage point lift on a 5% baseline at 80% power and 5% significance requires about 3,600 visitors per variation. Use the calculator above to get an exact number for your scenario.
What sample size do I need for 95% confidence?
A 95% confidence level corresponds to a 5% significance level (α = 0.05), which is the industry standard. The required sample size also depends on your baseline rate, minimum detectable effect, and statistical power. For example, with a 10% baseline rate and a 1 percentage point MDE at 80% power, you would need roughly 14,700 visitors per group.
What happens if my sample size is too small?
Running an A/B test with an insufficient sample size means your test is underpowered. This leads to a high risk of false negatives — failing to detect a real effect. It also makes any observed results unreliable and more susceptible to random noise. Always calculate the required sample size before starting your test.
How long should I run my A/B test?
The minimum duration depends on your required sample size and daily traffic. Divide the total required sample size by your daily visitor count. Additionally, it's best practice to run tests for at least one full business cycle (typically 1–2 weeks) to account for day-of-week effects, even if you reach the required sample size sooner.
Should I use a one-tailed or two-tailed test?
Use a two-tailed test in most cases. A two-tailed test detects effects in both directions (improvements and regressions), which is important because a change that you expect to improve metrics can sometimes make things worse. Only use a one-tailed test if you are certain the effect can only go in one direction and you do not care about detecting effects in the opposite direction.
What is the Bonferroni correction and when do I need it?
The Bonferroni correction adjusts the significance level when you test multiple variants against a single control (e.g., an A/B/C test). Without correction, the probability of at least one false positive increases with each additional comparison. The correction divides your significance level by the number of comparisons. For example, with 3 variants and α = 5%, the adjusted α per comparison is 2.5%. This calculator applies the correction automatically when you set more than 2 variations.