A/B Test Calculator

Calculate statistical significance of your experiments

A

Control

0.00%
Conversion Rate
B

Variant

0.00%
Conversion Rate
📊

Statistical Analysis

Results at selected confidence level

0%
Relative Uplift
0%
Confidence
p-value
Z-score
Statistical Significance 0%
0% 50% 95% 100%
📋
Enter Your Data
Fill in visitors and conversions for both variants to calculate statistical significance.
Confidence Level:

🧪 What is A/B Testing?

A/B testing (also called split testing) is a method of comparing two versions of a webpage, email, or other marketing asset to determine which one performs better. You split your audience randomly between version A (control) and version B (variant), then measure which version achieves more conversions.

The key challenge is determining whether the difference in performance is statistically significant or just due to random chance. This calculator uses a two-proportion z-test to determine if your results are reliable enough to make a decision.

📐 The Math Behind It

Z-Score Formula
Z = (p₁ – p₂) ÷ √[p(1-p)(1/n₁ + 1/n₂)]
Where p₁ and p₂ are conversion rates, n₁ and n₂ are sample sizes, and p is the pooled proportion.

The p-value represents the probability that the observed difference occurred by chance. A p-value below your threshold (e.g., 0.05 for 95% confidence) indicates statistical significance.

📏 Sample Size Requirements

The sample size you need depends on your baseline conversion rate and the minimum detectable effect (MDE) you want to measure.

Baseline Rate 10% MDE 20% MDE 50% MDE
1% 190,000 48,000 7,700
3% 62,000 15,500 2,500
5% 36,400 9,100 1,500
10% 17,200 4,300 700
20% 7,700 1,900 310

*Per variant, at 95% confidence and 80% statistical power. MDE = Minimum Detectable Effect (relative change).

💡 Best Practices

✓ Determine Sample Size First
Calculate required sample size before starting. Running until you see significance leads to false positives.
✓ Run Full Business Cycles
Include weekdays and weekends. Behavior varies by day, so run tests for at least 1-2 full weeks.
✓ Test One Variable
Change only one element at a time. Testing multiple changes makes it impossible to know what worked.
✓ Document Everything
Record hypothesis, metrics, dates, and segments. Build organizational knowledge from every test.

⚠️ Common Mistakes

✗ Peeking at Results
Checking results daily and stopping when significant inflates false positive rate from 5% to 30%+.
✗ Underpowered Tests
Running with too few visitors means you’ll miss real effects and may wrongly conclude “no difference.”
✗ Testing Too Many Variants
Each variant needs traffic. With 5 variants, you need 5x the sample size for the same statistical power.
✗ Ignoring Segments
Overall results may hide that variant B wins for mobile but loses for desktop. Check key segments.

❓ Frequently Asked Questions

95% is the industry standard and works for most cases. Use 99% for high-stakes decisions (pricing, major redesigns) where false positives are costly. 90% is acceptable for low-risk tests or when you need faster results and can accept more uncertainty.
Three common reasons: (1) Not enough traffic — you need more visitors to detect small differences, (2) Effect is too small — the actual difference between variants may be negligible, (3) High variance — your conversion rate fluctuates a lot. Calculate required sample size and run until you reach it.
No. Early stopping inflates your false positive rate dramatically. If you’ve calculated 10,000 visitors needed, run until 10,000 — even if results look significant at 3,000. The exception is if you’re using sequential testing methods designed for early stopping.
Confidence (1 minus alpha) is the probability of NOT declaring a winner when there isn’t one (avoiding false positives). Power (1 minus beta) is the probability of detecting a real effect when it exists (avoiding false negatives). Standard values: 95% confidence, 80% power.
Statistical significance doesn’t equal practical significance. A 0.1% conversion lift may be statistically significant with enough data, but not worth implementation effort. Consider: implementation cost, maintenance burden, and whether the lift is meaningful for your business.
Define one primary metric before the test starts — that’s what determines success. Secondary metrics provide context but shouldn’t change your decision. Testing multiple metrics increases false positive risk; apply Bonferroni correction if you must test several primary metrics.