A/B Test Calculator

Calculate statistical significance of your experiments

Control

Visitors / Sample Size

Conversions

0.00%

Conversion Rate

Variant

Visitors / Sample Size

Conversions

0.00%

Conversion Rate

📊

Statistical Analysis

Results at selected confidence level

Relative Uplift

Confidence

—

p-value

—

Z-score

Statistical Significance 0%

0% 50% 95% 100%

📋

Enter Your Data

Fill in visitors and conversions for both variants to calculate statistical significance.

Confidence Level:

🧪 What is A/B Testing?

A/B testing (also called split testing) is a method of comparing two versions of a webpage, email, or other marketing asset to determine which one performs better. You split your audience randomly between version A (control) and version B (variant), then measure which version achieves more conversions.

The key challenge is determining whether the difference in performance is statistically significant or just due to random chance. This calculator uses a two-proportion z-test to determine if your results are reliable enough to make a decision.

📐 The Math Behind It

Z-Score Formula

Z = (p₁ – p₂) ÷ √[p(1-p)(1/n₁ + 1/n₂)]

Where p₁ and p₂ are conversion rates, n₁ and n₂ are sample sizes, and p is the pooled proportion.

The p-value represents the probability that the observed difference occurred by chance. A p-value below your threshold (e.g., 0.05 for 95% confidence) indicates statistical significance.

📏 Sample Size Requirements

The sample size you need depends on your baseline conversion rate and the minimum detectable effect (MDE) you want to measure.

Baseline Rate	10% MDE	20% MDE	50% MDE
1%	`190,000`	`48,000`	`7,700`
3%	`62,000`	`15,500`	`2,500`
5%	`36,400`	`9,100`	`1,500`
10%	`17,200`	`4,300`	`700`
20%	`7,700`	`1,900`	`310`

*Per variant, at 95% confidence and 80% statistical power. MDE = Minimum Detectable Effect (relative change).

💡 Best Practices

✓ Determine Sample Size First

Calculate required sample size before starting. Running until you see significance leads to false positives.

✓ Run Full Business Cycles

Include weekdays and weekends. Behavior varies by day, so run tests for at least 1-2 full weeks.

✓ Test One Variable

Change only one element at a time. Testing multiple changes makes it impossible to know what worked.

✓ Document Everything

Record hypothesis, metrics, dates, and segments. Build organizational knowledge from every test.

⚠️ Common Mistakes

✗ Peeking at Results

Checking results daily and stopping when significant inflates false positive rate from 5% to 30%+.

✗ Underpowered Tests

Running with too few visitors means you’ll miss real effects and may wrongly conclude “no difference.”

✗ Testing Too Many Variants

Each variant needs traffic. With 5 variants, you need 5x the sample size for the same statistical power.

✗ Ignoring Segments

Overall results may hide that variant B wins for mobile but loses for desktop. Check key segments.

❓ Frequently Asked Questions

95% is the industry standard and works for most cases. Use 99% for high-stakes decisions (pricing, major redesigns) where false positives are costly. 90% is acceptable for low-risk tests or when you need faster results and can accept more uncertainty.

Three common reasons: (1) Not enough traffic — you need more visitors to detect small differences, (2) Effect is too small — the actual difference between variants may be negligible, (3) High variance — your conversion rate fluctuates a lot. Calculate required sample size and run until you reach it.

No. Early stopping inflates your false positive rate dramatically. If you’ve calculated 10,000 visitors needed, run until 10,000 — even if results look significant at 3,000. The exception is if you’re using sequential testing methods designed for early stopping.

Confidence (1 minus alpha) is the probability of NOT declaring a winner when there isn’t one (avoiding false positives). Power (1 minus beta) is the probability of detecting a real effect when it exists (avoiding false negatives). Standard values: 95% confidence, 80% power.

Statistical significance doesn’t equal practical significance. A 0.1% conversion lift may be statistically significant with enough data, but not worth implementation effort. Consider: implementation cost, maintenance burden, and whether the lift is meaningful for your business.

Define one primary metric before the test starts — that’s what determines success. Secondary metrics provide context but shouldn’t change your decision. Testing multiple metrics increases false positive risk; apply Bonferroni correction if you must test several primary metrics.